=Paper= {{Paper |id=None |storemode=property |title=A User-centric Evaluation of Recommender Algorithms for an Event Recommendation System |pdfUrl=https://ceur-ws.org/Vol-811/paper10.pdf |volume=Vol-811 }} ==A User-centric Evaluation of Recommender Algorithms for an Event Recommendation System== https://ceur-ws.org/Vol-811/paper10.pdf

A User-centric Evaluation of Recommender Algorithms for
an Event Recommendation System

Simon Dooms Toon De Pessemier Luc Martens
Wica-INTEC, IBBT-Ghent University Wica-INTEC, IBBT-Ghent University Wica-INTEC, IBBT-Ghent University
G. Crommenlaan 8 box 201 G. Crommenlaan 8 box 201 G. Crommenlaan 8 box 201
B-9050 Ghent, Belgium B-9050 Ghent, Belgium B-9050 Ghent, Belgium
Simon.Dooms@UGent.be Toon.DePessemier@UGent.be Luc1.Martens@UGent.be

ABSTRACT hand is relatively new. Events are so called one-and-only
While several approaches to event recommendation already items [5], which makes them harder to recommend. While
exist, a comparison study including di↵erent algorithms re- other types of items generally remain available (and thus
mains absent. We have set up an online user-centric based recommendable) for longer periods of time, this is not the
evaluation experiment to find a recommendation algorithm case for events. They take place at a specific moment in time
that improves user satisfaction for a popular Belgian cul- and place to become irrelevant very quickly afterwards.
tural events website. Both implicit and explicit feedback in Some approaches towards event recommendation do exist.
the form of user interactions with the website were logged For the Pittsburgh area, a cultural event recommender was
over a period of 41 days, serving as the input for 5 popular build around trust relations [8]. Friends could be explicitly
recommendation approaches. By means of a questionnaire and implicitly rated for trust ranging from ‘trust strongly’ to
users were asked to rate di↵erent qualitative aspects of the ‘block’. A recommender system for academic events [7] fo-
recommender system including accuracy, novelty, diversity, cused more on social network analysis (SNA) in combination
satisfaction, and trust. with collaborative filtering (CF) and finally Cornelis et al. [3]
Results show that a hybrid of a user-based collaborative described a hybrid event recommendation approach where
filtering and content-based approach outperforms the other both aspects of CF and content-based algorithms were em-
algorithms on almost every qualitative metric. Correlation ployed. To our knowledge however, event recommendation
values between the answers in the questionnaire seem to algorithms were never compared in a user-centric designed
indicate that both accuracy and transparency are correlated experiment with a focus on optimal user satisfaction.
the most with general user satisfaction of the recommender For a comparison of algorithms often o✏ine metrics like
system. RMSE, MAE or precision and recall are calculated. These
kinds of metrics allow automated and objective comparison
of the accuracy of the algorithms but they alone can not
Categories and Subject Descriptors guarantee user satisfaction in the end [9]. As shown in [2],
H.4 [Information Systems Applications]: Miscellaneous; the use of di↵erent o✏ine metrics can even lead to a di↵erent
H5.2 [User Interfaces]: User-centered design outcome of the ‘best’ algorithm for the job. Hayes et al. [6]
state that real user satisfaction can only be measured in an
General Terms online context. We want to improve the user satisfaction for
real-life users of the event website and are therefore opting
Algorithms, Experimentation, Human Factors. for an online user-centric evaluation of di↵erent recommen-
dation algorithms.
Keywords
Recommender systems, events, user-centric evaluation, ex- 2. EXPERIMENT SETUP
periment, correlation, recommendation algorithms.
To find the recommendation algorithm that results in the
highest user satisfaction, we have set up a user-centric eval-
1. INTRODUCTION uation experiment. For a period of 41 days, we monitored
More and more recommender systems are being integrated both implicit and explicit user feedback in the form of user
with web based platforms that su↵er from information over- interactions with the event website. We used the collected
load. By personalizing content based on user preferences, feedback as input for 5 di↵erent recommendation algorithms,
recommender systems assist in selecting relevant items on each of which generated a list of recommendations for every
these websites. In this paper, we focus on evaluating rec- user. Bollen et al. [1] hypothesizes that a set of somewhere
ommendations for a Belgian cultural events website. This between seven and ten items would be ideal in the sense that
website contains the details of more than 30,000 near future it can be quite varied but still manageable for the users. The
and ongoing cultural activities including movie releases, the- users therefore received a randomly chosen recommendation
ater shows, exhibitions, fairs and many others. list containing 8 events together with an online question-
In the research domain of recommender systems, numer- naire. They were asked to rate di↵erent aspects about the
ous studies have focused on recommending movies. They quality of their given recommendations.
have been studied thoroughly and many best practices are In the following subsections, we elaborate on the specifics
known. The area of event recommendations on the other of the experiment such as the feedback collection, the rec-

67
Feedback activity Feedback value Metadata field Weight
Click on ‘I like this’ 1.0 Artist 1.0
Share on Facebook/Twitter 0.9 Category 0.7
Click on Itinerary 0.6 Keyword 0.2
Click on Print 0.6
Click on ‘Go by bus/train’ 0.6 Table 2: The metadata fields used by the content-
Click on ‘Show more details’ 0.5 based recommendation algorithm with their weights
Click on ‘Show more dates’ 0.5 indicating their relative importance.
Mail to a friend 0.4
Browse to an event 0.3

Table 1: The distinct activities that were collected and if so, the relative (accuracy) improvement of more in-
as user feedback together with the feedback value telligent algorithms over random recommendations.
indicating the interest of an individual user for a Because of its widespread use and general applicability,
specific event ranging from 1.0 (very interested) to standard collaborative filtering (CF) is chosen as the second
0.3 (slightly interested). algorithm of the experiment. We opted for the user-based
nearest neighbor version of the algorithm (UBCF) because
of the higher user-user overlap compared to the item-item
overlap. Neighbors were defined as being users with a min-
ommendation algorithms, how we randomized the users, and
imum overlap of 1 event in their feedback profiles but had
the questionnaire.
to be at least 5% similar according to the cosine similarity
metric.
2.1 Feedback collection The third algorithm evaluated in this experiment is sin-
Feedback collection is a very important aspect of the rec- gular value decomposition (SVD) [11], a well-known matrix
ommendation process. Since the final recommendations can factorization technique that addresses the problems of syn-
only be as good as the quality of their input, collecting as onymy, polysemy, sparsity, and scalability for large datasets.
much high quality feedback as possible is of paramount im- Based on preceding simulations on an o✏ine dataset with
portance. Previous feedback experiments we ran on the web- historical data of the website, the parameters of the algo-
site [4] showed that collecting explicit feedback (in the form rithm were determined: 100 initial steps were used to train
of explicit ratings) is very hard, since users do not rate of- the model and the number of features was set at 70.
ten. Clicking and browsing through the event information Considering the transiency of events and the ability of
pages are on the other hand activities that were abundantly content-based (CB) algorithms to recommend items before
logged. For optimal results, we ultimately combined im- they received any feedback, a CB algorithm was chosen as
plicit and explicit user feedback gathered during the run of the fourth algorithm. This algorithm matches the event
the experiment. metadata, which contain the title, the categories, the artist(s),
Since explicit ratings are typically provided after an event and keywords originating from a textual description of the
has been visited, algorithms based on collaborative filtering event, to the personal preferences of the user, which are
would be useless. It therefore makes sense to utilize also composed by means of these metadata and the user feed-
implicit feedback indicators like printing the event’s infor- back gathered during the experiment. A weighting value
mation, which can be collected before the event has taken is assigned to the various metadata fields (see Table 2),
place. In total 11 distinct feedback activities were combined thereby attaching a relative importance to the fields during
into a feedback value that expressed the interest of a user the matching process (e.g., a user preference for an artist is
for a specific event. more important than a user preference for a keyword of the
The di↵erent activities are listed in Table 1 together with description). The employed keyword extraction mechanism
their resulting feedback values which were intuitively deter- is based on a term frequency-inverse document frequency
mined. The max() function is used to accumulate multiple (tf-idf) weighting scheme, and includes features as stemming
feedback values in case a user provided feedback in more and filtering stop words.
than one way for the same event. Since pure CB algorithms might produce recommenda-
tions with a limited diversity [9], and CF techniques might
2.2 Recommendation Algorithms produce suboptimal results due to a large amount of unrated
To assess the influence of the recommendation algorithm items (cold start problem), a hybrid algorithm (CB+UBCF),
on the experience of the end-user, 5 di↵erent algorithms are combining features of both CB and CF techniques, com-
used in this experiment. Each user, unaware of the di↵er- pletes the list. This fifth algorithm combines the best per-
ent algorithms, is randomly assigned to one of the 5 groups sonal sugestions produced by the CF with the best suges-
receiving recommendations generated by one of these algo- tions originating from the CB algorithm, thereby generat-
rithms as described in Section 2.3. ing a merged list of hybrid recommendations for every user.
As a baseline suggestion mechanism, the random recom- This algorithm acts on the resulting recommendation lists
mender (RAND), which generates recommendations by per- produced by the CF and CB recommender, and does not
forming a random sampling of the available events, is used. change the internal working of these individual algorithms.
The only requirement of these random recommendations is Both lists are interwoven while alternately switching their
that the event is still available (i.e. it is still possible for the order such that both lists have their best recommendation
user to attend the event). The evaluation of these random on top in 50% of the cases.
recommendations allows to investigate if users can distin- For each algorithm, the final event recommendations are
guish random events from personalized recommendations, checked for their availability and familiarity with the user.

68
Events that are not available for attendance anymore, or Algorithm #U sers
events that the user has already explored (by viewing the CB 43
webpage, or clicking the link) are replaced in the recom- CB+UBCF 36
mendation list. RAND 45
SVD 36
2.3 Randomizing Users UBCF 33
Since certain users have provided only a limited amount of
feedback during the experiment, not all recommendation al- Table 3: The 5 algorithms compared in this exper-
gorithms were able to generate personal suggestions for these iment and the number of users that actually com-
users. CF algorithms, for instance, can only identify neigh- pleted the questionnaire about their recommenda-
bors for users who have overlapping feedback with other tion lists.
users (i.e. provided feedback on the same event as another
user). Without these neighbors, CF algorithms are not able Q4 The recommender system helps me discover new prod-
to produce recommendations. Therefore, users with a lim- ucts.
ited profile, hindering (some of) the algorithms to generate
(enough) recommendations for that user, are treated sepa- Q5 The items recommended to me are similar to each other
rately in the analysis. Many of these users are not very ac- (reverse scale).
tive on the website or did not finish the evaluation procedure
as described in Section 2.4. This group of cold-start users Q7 I didn’t understand why the items were recommended
received recommendations from a randomly assigned algo- to me (reverse scale).
rithm that was able to generate recommendations for that
user based on the limited profile. Since the random recom- Q8 Overall, I am satisfied with the recommender.
mender can produce suggestions even without user feedback,
at least 1 algorithm was able to generate a recommendation Q10 The recommender can be trusted.
list for every user. The comparative evaluation of the 5 al-
gorithms however, is based on the remaining users. Each Q13 I would attend some of the events recommended, given
of these users is randomly assigned to 1 of the 5 algorithms the opportunity.
which generates personal suggestions for that user. This
way, the 5 algorithms, as described in Section 2.2, are eval- 3. RESULTS
uated by a number of randomly selected users. We allowed all users of the event website to participate in
our experiment and encouraged them to do so by means of
2.4 Evaluation Procedure e-mail and a banner on the site. In total 612 users responded
While prediction accuracy of ratings used to be the only positively to our request. After a period of feedback logging,
evaluation criteria for recommender systems, during recent as described in section 2.1, they were randomly distributed
years optimizing the user experience has increasingly gained across the 5 recommendation algorithms which calculated
interest in the evaluation procedure. Existing research has for each of them a list of 8 recommendations. After the
proposed a set of criteria detailing the characteristics that recommendations were made available on the website, users
constitute a satisfying and e↵ective recommender system were asked by mail to fill out the accompanying online ques-
from the user’s point of view. To combine these criteria into tionnaire as described in section 2.4.
a more comprehensive model which can be used to evaluate Of the 612 users who were interested in the experiment,
the perceived qualities of recommender systems, Pu et al. 232 actually completed the online questionnaire regarding
have developed an evaluation framework for recommender their recommendations. After removal of fake samples (i.e.,
systems [10]. This framework aims to assess the perceived users who answered every question with the same value)
qualities of recommenders such as their usefulness, usabil- and users with incomplete (feedback) profiles, 193 users re-
ity, interface and interaction qualities, user satisfaction of mained. They had by average 22 consumptions (i.e., ex-
the systems and the influence of these qualities on users’ pressed feedback values for events) and 84% of them had 5
behavioral intentions including their intention to tell their or more consumptions. The final distribution of the users
friends about the system, the purchase of the products rec- across the algorithms is displayed in Table 3.
ommended to them, and the return to the system in the Figure 1 shows the averaged results of the answers pro-
future. Therefore, we adopted (part of) this framework to vided by the 193 users in this experiment for the 8 questions
measure users’ subjective attitudes based on their experience we described in section 2.4 and for each algorithm.
towards the event recommender and the various algorithms Evaluating the answers to the questionnaire showed that
tested during our experiment. Via an online questionnaire, the hybrid recommender (CB+UBCF) achieved the best av-
test users were asked to answer 14 questions on 5-point Lik- eraged results to all questions, except for question Q5, which
ert scale from “strongly disagree” (1) to “strongly agree” (5) asked the user to evaluate the similarity of the recommen-
regarding aspects as recommendation accuracy, novelty, di- dations (i.e. diversity). For question Q5 the random recom-
versity, satisfaction and trust of the system. We selected the mender obtained the best results in terms of diversity, since
following 8 most relevant questions for this research regard- random suggestions are rarely similar to each other. The CF
ing various aspects of the event recommendation system. algorithm was the runner-up in the evaluation and achieved
a second place after the hybrid recommender for almost all
Q1 The items recommended to me matched my interests. questions (again except for Q5, where CF was the fourth af-
ter the random recommender, the hybrid recommender and
Q2 Some of the recommended items are familiar to me. SVD).

69
Figure 1: The averaged result of the answers (5-point Likert scale from “strongly disagree” (1) to “strongly
agree” (5)) of the evaluation questionnaire for each algorithm and questions Q1, Q2, Q4, Q5, Q7, Q8, Q10 and
Q13. The error bars indicate the 95% confidence interval. Note that questions Q5 and Q7 were in reverse
scale.

The success of the hybrid recommenders is not only clearly
visible when comparing the average scores for each question
(Figure 1), but also showed to be statistically significantly
better than every other algorithm (except for the CF recom-
mender) according to a Wilcoxon rank test (p < 0.05) for
the majority of the questions (Q1, Q2, Q8, Q10 and Q13).
Table 4 shows the algorithms and questions for which sta-
tistically significant di↵erences could be noted according to
this non-parametric statistical hypothesis test.
The average performance of SVD was a bit disappointing
by achieving the worst results for questions Q1, Q7, Q8, and
the second worst results (after the random recommender)
for questions Q2, Q4, Q10, Q11, and Q13. So surprisingly
the SVD algorithm performs (averagely) worse than the ran-
dom method on some fundamental questions like for example Figure 2: The histogram of the values (1 to 5) that
Q8 which addresses the general user satisfaction. We note were given to question Q8 for algorithm CB (left),
however that the di↵erence in values between SVD and the RAND (middle) and SVD (right).
RAND algorithm was not found to be statistically significant
except for question Q5.
We looked more closer into this observation and plotted
a histogram (Figure 2) of the di↵erent values (1 to 5) for rating values for the SVD recommender are not only visible
the answers provided for question Q8. A clear distinction in the results of Q8, but also for other questions like Q2
between the histogram of the SVD algorithm and the his- and Q5. These findings indicate that SVD works well for
tograms of the other algorithms (CB and RAND shown in many users, but also provides inaccurate recommendations
the figure) can be seen. Whereas for CB and RAND most for a considerable number of other users. These inaccurate
values are grouped towards one side of the histogram (i.e. recommendations may be due to a limited amount of user
the higher values), this is not the case for the SVD. It turns feedback and therefore sketchy user profiles.
out that the opinions about the general satisfaction of the Figure 1 seems to indicate that some of the answers to
SVD algorithm where somewhat divided between good and the questions are highly correlated. One clear example is
bad with no apparent winning answer. These noteworthy question Q1 about whether or not the recommended items

70
CB CB+UBCF RAND SVD UBCF
Q1, Q2, Q5,
CB - Q2, Q5 Q1, Q5, Q7, Q8 Q2, Q5, Q10
Q8, Q10, Q13
Q1, Q2, Q5, Q1, Q2, Q4, Q5, Q1, Q2, Q7,
CB+UBCF - Q13
Q8, Q10, Q13 Q7, Q8, Q10, Q13 Q8, Q10, Q13
Q1, Q2, Q4, Q5,
RAND Q2, Q5 - Q5 Q2, Q5, Q10
Q7, Q8, Q10, Q13
Q1, Q2, Q7, Q1, Q2, Q7,
SVD Q1, Q5, Q7, Q8 Q5 -
Q8, Q10, Q13 Q8, Q10
Q1, Q2, Q7,
UBCF Q2, Q5, Q10 Q13 Q2, Q5, Q10 -
Q8, Q10

Table 4: The complete matrix of statistically significant di↵erences between the algorithms on all the questions
using the Wilcoxon rank test on a confidence level of 0.95. Note that the matrix is symmetric.

matched the user’s interest and question Q8 which asked percentage of the variance in the dependent variable can be
about the general user satisfaction. As obvious as this cor- explained by the model. R2 will be 1 for a perfect fit and 0
relation may be, other correlated questions may not be so when no linear relationship could be found.
easy to detect by inspecting a graph with averaged results
and so we calculated the complete correlation matrix for Q1 Q7, Q8, Q10, Q13 (R2 = 0.7131)
every question over all the algorithms using the two-tailed
Pearson correlation metric (Table 5). Q2 Q7, Q10, Q13 (R2 = 0.2195)
From the correlation values two similar trends can be no- Q4 Q10, Q13 (R2 = 0.326)
ticed for questions Q8 and Q10 dealing with respectively
the user satisfaction and trust of the system. The answers Q5 Q1, Q13 (R2 = 0.02295)
to these questions are highly correlated (very significant
p < 0.01) with almost every other question except for Q5 Q7 Q1, Q2, Q8, Q10 (R2 = 0.6095)
(diversity). We must be careful not to confuse correlation Q8 Q1, Q7, Q10, Q13 (R2 = 0.747)
with causality, but still data indicates the strong relation
between user satisfaction and recommendation accuracy and Q10 Q1, Q2, Q4, Q7, Q8, Q13 (R2 = 0.7625)
transparency.
This strong relation may be another reason why SVD per- Q13 Q1, Q2, Q4, Q5, Q8, Q10 (R2 = 0.6395)
formed very badly in the experiment. Its inner workings are
the most obscure and least obvious to the user and therefore The most interesting regression result is the line were Q8
also the least transparent. (satisfaction) is predicted by Q1, Q7, Q10 and Q13. This
Another interesting observation lies in the correlation val- result further strengthens our belief that accuracy (Q1) and
ues of question Q5. The answers to this diversity question transparency (Q7) are the main influencers of user satisfac-
are almost completely unrelated to every other question (i.e., tion in our experiment (we consider Q10 and Q13 rather as
low correlation values which are not significant p > 0.05). results of satisfaction than real influencers but they are of
It seems like the users of the experiment did not value the course also connected).
diversity of a recommendation list as much as the other as- Table 6 shows the coverage of the algorithms in terms of
pects of the recommendation system. If we look at the av- the number of users it was able to produce recommendations
erage results (Figure 1) of the diversity question (lower is for. In our experiment we noticed an average coverage of
more diverse) we can see this idea confirmed. The ordering 66% excluding the random recommender.
of how diverse the recommendation lists produced by the
algorithms were, is in no way reflected in the general user Algorithm Coverage (%)
satisfaction or trust of the system. CB 69%
To gain some deeper insight into the influence of the qual- CB+UBCF 66%
itative attributes towards each other, we performed a sim- RAND 100%
ple linear regression analysis. By trying to predict an at- SVD 66%
tribute by using all the other ones as input to the regression UBCF 65%
function, a hint of causality may be revealed. As regres-
sion method we used multiple stepwise regression. We used Table 6: The 5 algorithms compared in this experi-
a combination of the forward and backward selection ap- ment and their coverage in terms of the number of
proach, which step by step tries to add new variables (or users for which they were able to generate a recom-
remove existing ones) to its model that have the highest mendation list of minimum 8 items.
marginal relative influence on the dependent variable. The
following lines express the regression results. We indicated Next to this online and user-centric experiment, we also
what attributes were added to the model by means of an ran some o✏ine tests and compared them to the real opin-
arrow notation. Between brackets we also indicated the co- ions of the users. We calculated the recommendations on a
efficient of determination R2 . This coefficient indicates what training set that randomly contained 80% of the collected
feedback in the experiment. Using the leftover 20% as the

71
Q1 Q2 Q4 Q5 Q7 Q8 Q10 Q13
(accuracy) (familiarity) (novelty) (diversity) (tranparency) (satisfaction) (trust) (usefulness)
Q1 1 .431 .459 .012 -.731 .767 .783 .718
Q2 .431 1 .227 .036 -.405 .387 .429 .415
Q4 .459 .227 1 -.037 -.424 .496 .516 .542
Q5 .012 .036 -.037 1 0.16 -.008 .001 -.096
Q7 -.731 -.405 -.424 .016 1 -.722 -.707 -.622
Q8 .767 .387 .496 -.008 -.722 1 .829 .712
Q10 .783 .429 .516 .001 -.707 .829 1 .725
Q13 .718 .415 .542 -.096 -.622 .712 .725 1

Table 5: The complete correlation matrix for the answers to the 8 most relevant questions on the online
questionnaire. The applied metric is the Pearson correlation and so values are distributed between -1.0
(negatively correlated) and 1.0 (positively correlated). Note that the matrix is symmetric and questions Q5
and Q7 were in reverse scale.

Algorithm Precision (%) Recall (%) F1 (%) system. The runner-up for this position would definitely be
CB 0.462 2.109 0.758 the UBCF algorithm followed by the CB algorithm. This
CB+UBCF 1.173 4.377 1.850 comes as no surprise considering that the hybrid algorithm
RAND 0.003 0.015 0.005 is mere a combination of these UBCF and CB algorithms.
SVD 0.573 2.272 0.915 Since the UBCF algorithm is second best, it looks like this
UBCF 1.359 4.817 2.119 algorithm is the most responsible for the success of the hy-
brid. While the weights of both algorithms were equal in
Table 7: The accuracy of the recommendation algo- this experiment (i.e., the 4 best recommendations of each
rithms in terms of precision, recall and F1-measure list were selected to be combined in the hybrid list), it would
based on an o✏ine analysis. be interesting to see how the results evolve if these weights
would be tuned more in favour of the CF approach (e.g.,
5 ⇤ U BCF + 3 ⇤ CB).
test set, the accuracy of every algorithm was calculated over Because we collected both implicit and explicit feedback
all users in terms of precision, recall and F1-measure (Table to serve as input for the recommendation algorithms, there
7). This procedure was repeated 10 times to average out were no restrictions as to what algorithms we were able to
any random e↵ects. use. Implicit feedback that was logged before an event took
By comparing the o✏ine and online results in our exper- place allowed the use of CF algorithms and the availability
iment we noticed a small change in the ranking of the al- of item metadata enabled content-based approaches. Only
gorithms. In terms of precision the UBCF approach came in this ideal situation a hybrid CB+UBCF algorithm can
out best followed by respectively CB+UBCF, SVD, CB and serve an event recommendation system.
RAND. While the hybrid approach performed best in the The slightly changed coverage is another issue that may
online analysis, this is not the case for the o✏ine tests. come up when a hybrid algorithm like this is deployed. While
Note that also SVD and CB have swapped places in the the separate CB and UBCF algorithms had respectively cov-
ranking. SVD showed slightly better at predicting user be- erages of 69% and 65%, the hybrid combination served 66%
haviour than the CB algorithm. A possible explanation (for of the users. We can explain this increase of 1% towards
the inverse online results) is that users in the online test the UBCF by noting that the hybrid algorithm requires a
may have valued the transparency of the CB algorithm over minimum of only 4 recommendations (versus 8 normally) to
its (objective) accuracy. Our o✏ine evaluation test further be able to provide the users with a recommendation list.
underlines the shortcomings of these procedures. In our ex-
periment we had over 30,000 items that were available for
recommendation and on average only 22 consumptions per 5. CONCLUSIONS
user. The extreme low precision and recall values are the For a Belgian cultural events website we wanted to find
result of this extreme sparsity problem. a recommendation algorithm that improves the user expe-
It would have been interesting to be able to correlate the rience in terms of user satisfaction and trust. Since o✏ine
accuracy values obtained by o✏ine analysis with the subjec- evaluation metrics are inadequate for this task, we have set
tive accuracy values provided by the users. Experiments up an online and user-centric evaluation experiment with 5
however showed very fluctuating results with on the one popular and common recommendation algorithms i.e. CB,
hand users with close to zero precision and on the other CB+UBCF, RAND, SVD and UBCF. We logged both im-
hand some users with relative high precision values. These plicit and explicit feedback data in the form of weighted
results could therefore not be properly matched against the user interactions with the event website over a period of 41
online gathered results. days. We extracted the users for which every algorithm was
able to generate at least 8 recommendations and presented
each of these users with a recommendation list randomly
4. DISCUSSION chosen from one of the 5 recommendation algorithms. Users
The results clearly indicate the hybrid recommendation were asked to fill out an online questionnaire that addressed
algorithm (CB+UBCF) as the overall best algorithm for op- qualitative aspects of their recommendation lists including
timizing the user satisfaction in our event recommendation accuracy, novelty, diversity, satisfaction, and trust.

72
Results clearly showed that the CB+UBCF algorithm, Proceedings of the Indian International Conference on
which is a combination of both the recommendations of CB Artificial Intelligence, 2005.
and UBCF, outperforms (or is equally as good in the case [4] S. Dooms, T. De Pessemier, and L. Martens. An
of question Q2 and the UBCF algorithm) every other al- online evaluation of explicit feedback mechanisms for
gorithm except for the diversity aspect. In terms of diver- recommender systems. In Proceedings of the 7th
sity the random recommendations turned out best, which International Conference on Web Information Systems
of course makes perfectly good sense. Inspection of the and Technologies (WEBIST), 2011.
correlation values between the answers of the questions re- [5] X. Guo, G. Zhang, E. Chew, and S. Burdon. A hybrid
vealed however that diversity is in no way correlated with recommendation approach for one-and-only items. AI
user satisfaction, trust or for that matter any other qualita- 2005: Advances in Artificial Intelligence, pages
tive aspect we investigated. The recommendation accuracy 457–466, 2005.
and transparency on the other hand were the two qualita- [6] C. Hayes, P. Massa, P. Avesani, and P. Cunningham.
tive aspects highest correlated with the user satisfaction and An on-line evaluation framework for recommender
showed promising predictors in the regression analysis. systems. In Workshop on Personalization and
The SVD algorithm came out last in the ranking of the Recommendation in E-Commerce. Citeseer, 2002.
algorithms and was statistically even indistinguishable from [7] R. Klamma, P. Cuong, and Y. Cao. You never walk
the random recommender for most of the questions except alone: Recommending academic events based on social
for again the diversity question (Q5). A histogram of the network analysis. Complex Sciences, pages 657–670,
values for SVD and question Q8 puts this into context by re- 2009.
vealing an almost black and white opinion pattern expressed
[8] D. Lee. Pittcult: trust-based cultural event
by the users in the experiment.
recommender. In Proceedings of the 2008 ACM
conference on Recommender systems, pages 311–314.
6. FUTURE WORK ACM, 2008.
While we were able to investigate numerous di↵erent qual- [9] S. McNee, J. Riedl, and J. Konstan. Being accurate is
itative aspect about each algorithm individually, the exper- not enough: how accuracy metrics have hurt
iment did not allow us, apart from indicating a best and recommender systems. In CHI’06 extended abstracts
worst algorithm, to construct an overall ranking of the rec- on Human factors in computing systems, page 1101.
ommendation algorithms. Each user ended up evaluating ACM, 2006.
just one algorithm. As our future work, we intend to extend [10] P. Pu and L. Chen. A user-centric evaluation
this experiment with a focus group allowing to elaborate on framework of recommender systems. In Proc. ACM
the reasoning behind some of the answers users provided and RecSys 2010 Workshop on User-Centric Evaluation of
compare subjective rankings of the algorithms. Recommender Systems and Their Interfaces
We also plan to extend our regression analysis to come up (UCERSTI), 2010.
with a causal path model that will allow us to have a better [11] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, and
understanding as to how the di↵erent algorithms influence M. U. M. D. O. C. SCIENCE. Application of
the overall satisfaction. dimensionality reduction in recommender system-a
case study. Citeseer, 2000.
7. ACKNOWLEDGMENTS
The research activities that have been described in this pa-
per were funded by a PhD grant to Simon Dooms of the In-
stitute for the Promotion of Innovation through Science and
Technology in Flanders (IWT Vlaanderen) and a PhD grant
to Toon De Pessemier of the Fund for Scientific Research-
Flanders (FWO Vlaanderen). We would like to thank Cul-
tuurNet Vlaanderen for the e↵ort and support they were
willing to provide for deploying the experiment described in
this paper.

8. REFERENCES
[1] D. Bollen, B. Knijnenburg, M. Willemsen, and
M. Graus. Understanding choice overload in
recommender systems. In Proceedings of the fourth
ACM conference on Recommender systems, pages
63–70. ACM, 2010.
[2] E. Campochiaro, R. Casatta, P. Cremonesi, and
R. Turrin. Do metrics make recommender algorithms?
In Proceedings of the 2009 International Conference
on Advanced Information Networking and Applications
Workshops, WAINA ’09, pages 648–653, Washington,
DC, USA, 2009. IEEE Computer Society.
[3] C. Cornelis, X. Guo, J. Lu, and G. Zhang. A fuzzy
relational approach to event recommendation. In