=Paper=
{{Paper
|id=None
|storemode=property
|title=A User-centric Evaluation of Recommender Algorithms for an Event Recommendation System
|pdfUrl=https://ceur-ws.org/Vol-811/paper10.pdf
|volume=Vol-811
}}
==A User-centric Evaluation of Recommender Algorithms for an Event Recommendation System==
A User-centric Evaluation of Recommender Algorithms for an Event Recommendation System Simon Dooms Toon De Pessemier Luc Martens Wica-INTEC, IBBT-Ghent University Wica-INTEC, IBBT-Ghent University Wica-INTEC, IBBT-Ghent University G. Crommenlaan 8 box 201 G. Crommenlaan 8 box 201 G. Crommenlaan 8 box 201 B-9050 Ghent, Belgium B-9050 Ghent, Belgium B-9050 Ghent, Belgium Simon.Dooms@UGent.be Toon.DePessemier@UGent.be Luc1.Martens@UGent.be ABSTRACT hand is relatively new. Events are so called one-and-only While several approaches to event recommendation already items [5], which makes them harder to recommend. While exist, a comparison study including di↵erent algorithms re- other types of items generally remain available (and thus mains absent. We have set up an online user-centric based recommendable) for longer periods of time, this is not the evaluation experiment to find a recommendation algorithm case for events. They take place at a specific moment in time that improves user satisfaction for a popular Belgian cul- and place to become irrelevant very quickly afterwards. tural events website. Both implicit and explicit feedback in Some approaches towards event recommendation do exist. the form of user interactions with the website were logged For the Pittsburgh area, a cultural event recommender was over a period of 41 days, serving as the input for 5 popular build around trust relations [8]. Friends could be explicitly recommendation approaches. By means of a questionnaire and implicitly rated for trust ranging from ‘trust strongly’ to users were asked to rate di↵erent qualitative aspects of the ‘block’. A recommender system for academic events [7] fo- recommender system including accuracy, novelty, diversity, cused more on social network analysis (SNA) in combination satisfaction, and trust. with collaborative filtering (CF) and finally Cornelis et al. [3] Results show that a hybrid of a user-based collaborative described a hybrid event recommendation approach where filtering and content-based approach outperforms the other both aspects of CF and content-based algorithms were em- algorithms on almost every qualitative metric. Correlation ployed. To our knowledge however, event recommendation values between the answers in the questionnaire seem to algorithms were never compared in a user-centric designed indicate that both accuracy and transparency are correlated experiment with a focus on optimal user satisfaction. the most with general user satisfaction of the recommender For a comparison of algorithms often o✏ine metrics like system. RMSE, MAE or precision and recall are calculated. These kinds of metrics allow automated and objective comparison of the accuracy of the algorithms but they alone can not Categories and Subject Descriptors guarantee user satisfaction in the end [9]. As shown in [2], H.4 [Information Systems Applications]: Miscellaneous; the use of di↵erent o✏ine metrics can even lead to a di↵erent H5.2 [User Interfaces]: User-centered design outcome of the ‘best’ algorithm for the job. Hayes et al. [6] state that real user satisfaction can only be measured in an General Terms online context. We want to improve the user satisfaction for real-life users of the event website and are therefore opting Algorithms, Experimentation, Human Factors. for an online user-centric evaluation of di↵erent recommen- dation algorithms. Keywords Recommender systems, events, user-centric evaluation, ex- 2. EXPERIMENT SETUP periment, correlation, recommendation algorithms. To find the recommendation algorithm that results in the highest user satisfaction, we have set up a user-centric eval- 1. INTRODUCTION uation experiment. For a period of 41 days, we monitored More and more recommender systems are being integrated both implicit and explicit user feedback in the form of user with web based platforms that su↵er from information over- interactions with the event website. We used the collected load. By personalizing content based on user preferences, feedback as input for 5 di↵erent recommendation algorithms, recommender systems assist in selecting relevant items on each of which generated a list of recommendations for every these websites. In this paper, we focus on evaluating rec- user. Bollen et al. [1] hypothesizes that a set of somewhere ommendations for a Belgian cultural events website. This between seven and ten items would be ideal in the sense that website contains the details of more than 30,000 near future it can be quite varied but still manageable for the users. The and ongoing cultural activities including movie releases, the- users therefore received a randomly chosen recommendation ater shows, exhibitions, fairs and many others. list containing 8 events together with an online question- In the research domain of recommender systems, numer- naire. They were asked to rate di↵erent aspects about the ous studies have focused on recommending movies. They quality of their given recommendations. have been studied thoroughly and many best practices are In the following subsections, we elaborate on the specifics known. The area of event recommendations on the other of the experiment such as the feedback collection, the rec- 67 Feedback activity Feedback value Metadata field Weight Click on ‘I like this’ 1.0 Artist 1.0 Share on Facebook/Twitter 0.9 Category 0.7 Click on Itinerary 0.6 Keyword 0.2 Click on Print 0.6 Click on ‘Go by bus/train’ 0.6 Table 2: The metadata fields used by the content- Click on ‘Show more details’ 0.5 based recommendation algorithm with their weights Click on ‘Show more dates’ 0.5 indicating their relative importance. Mail to a friend 0.4 Browse to an event 0.3 Table 1: The distinct activities that were collected and if so, the relative (accuracy) improvement of more in- as user feedback together with the feedback value telligent algorithms over random recommendations. indicating the interest of an individual user for a Because of its widespread use and general applicability, specific event ranging from 1.0 (very interested) to standard collaborative filtering (CF) is chosen as the second 0.3 (slightly interested). algorithm of the experiment. We opted for the user-based nearest neighbor version of the algorithm (UBCF) because of the higher user-user overlap compared to the item-item overlap. Neighbors were defined as being users with a min- ommendation algorithms, how we randomized the users, and imum overlap of 1 event in their feedback profiles but had the questionnaire. to be at least 5% similar according to the cosine similarity metric. 2.1 Feedback collection The third algorithm evaluated in this experiment is sin- Feedback collection is a very important aspect of the rec- gular value decomposition (SVD) [11], a well-known matrix ommendation process. Since the final recommendations can factorization technique that addresses the problems of syn- only be as good as the quality of their input, collecting as onymy, polysemy, sparsity, and scalability for large datasets. much high quality feedback as possible is of paramount im- Based on preceding simulations on an o✏ine dataset with portance. Previous feedback experiments we ran on the web- historical data of the website, the parameters of the algo- site [4] showed that collecting explicit feedback (in the form rithm were determined: 100 initial steps were used to train of explicit ratings) is very hard, since users do not rate of- the model and the number of features was set at 70. ten. Clicking and browsing through the event information Considering the transiency of events and the ability of pages are on the other hand activities that were abundantly content-based (CB) algorithms to recommend items before logged. For optimal results, we ultimately combined im- they received any feedback, a CB algorithm was chosen as plicit and explicit user feedback gathered during the run of the fourth algorithm. This algorithm matches the event the experiment. metadata, which contain the title, the categories, the artist(s), Since explicit ratings are typically provided after an event and keywords originating from a textual description of the has been visited, algorithms based on collaborative filtering event, to the personal preferences of the user, which are would be useless. It therefore makes sense to utilize also composed by means of these metadata and the user feed- implicit feedback indicators like printing the event’s infor- back gathered during the experiment. A weighting value mation, which can be collected before the event has taken is assigned to the various metadata fields (see Table 2), place. In total 11 distinct feedback activities were combined thereby attaching a relative importance to the fields during into a feedback value that expressed the interest of a user the matching process (e.g., a user preference for an artist is for a specific event. more important than a user preference for a keyword of the The di↵erent activities are listed in Table 1 together with description). The employed keyword extraction mechanism their resulting feedback values which were intuitively deter- is based on a term frequency-inverse document frequency mined. The max() function is used to accumulate multiple (tf-idf) weighting scheme, and includes features as stemming feedback values in case a user provided feedback in more and filtering stop words. than one way for the same event. Since pure CB algorithms might produce recommenda- tions with a limited diversity [9], and CF techniques might 2.2 Recommendation Algorithms produce suboptimal results due to a large amount of unrated To assess the influence of the recommendation algorithm items (cold start problem), a hybrid algorithm (CB+UBCF), on the experience of the end-user, 5 di↵erent algorithms are combining features of both CB and CF techniques, com- used in this experiment. Each user, unaware of the di↵er- pletes the list. This fifth algorithm combines the best per- ent algorithms, is randomly assigned to one of the 5 groups sonal sugestions produced by the CF with the best suges- receiving recommendations generated by one of these algo- tions originating from the CB algorithm, thereby generat- rithms as described in Section 2.3. ing a merged list of hybrid recommendations for every user. As a baseline suggestion mechanism, the random recom- This algorithm acts on the resulting recommendation lists mender (RAND), which generates recommendations by per- produced by the CF and CB recommender, and does not forming a random sampling of the available events, is used. change the internal working of these individual algorithms. The only requirement of these random recommendations is Both lists are interwoven while alternately switching their that the event is still available (i.e. it is still possible for the order such that both lists have their best recommendation user to attend the event). The evaluation of these random on top in 50% of the cases. recommendations allows to investigate if users can distin- For each algorithm, the final event recommendations are guish random events from personalized recommendations, checked for their availability and familiarity with the user. 68 Events that are not available for attendance anymore, or Algorithm #U sers events that the user has already explored (by viewing the CB 43 webpage, or clicking the link) are replaced in the recom- CB+UBCF 36 mendation list. RAND 45 SVD 36 2.3 Randomizing Users UBCF 33 Since certain users have provided only a limited amount of feedback during the experiment, not all recommendation al- Table 3: The 5 algorithms compared in this exper- gorithms were able to generate personal suggestions for these iment and the number of users that actually com- users. CF algorithms, for instance, can only identify neigh- pleted the questionnaire about their recommenda- bors for users who have overlapping feedback with other tion lists. users (i.e. provided feedback on the same event as another user). Without these neighbors, CF algorithms are not able Q4 The recommender system helps me discover new prod- to produce recommendations. Therefore, users with a lim- ucts. ited profile, hindering (some of) the algorithms to generate (enough) recommendations for that user, are treated sepa- Q5 The items recommended to me are similar to each other rately in the analysis. Many of these users are not very ac- (reverse scale). tive on the website or did not finish the evaluation procedure as described in Section 2.4. This group of cold-start users Q7 I didn’t understand why the items were recommended received recommendations from a randomly assigned algo- to me (reverse scale). rithm that was able to generate recommendations for that user based on the limited profile. Since the random recom- Q8 Overall, I am satisfied with the recommender. mender can produce suggestions even without user feedback, at least 1 algorithm was able to generate a recommendation Q10 The recommender can be trusted. list for every user. The comparative evaluation of the 5 al- gorithms however, is based on the remaining users. Each Q13 I would attend some of the events recommended, given of these users is randomly assigned to 1 of the 5 algorithms the opportunity. which generates personal suggestions for that user. This way, the 5 algorithms, as described in Section 2.2, are eval- 3. RESULTS uated by a number of randomly selected users. We allowed all users of the event website to participate in our experiment and encouraged them to do so by means of 2.4 Evaluation Procedure e-mail and a banner on the site. In total 612 users responded While prediction accuracy of ratings used to be the only positively to our request. After a period of feedback logging, evaluation criteria for recommender systems, during recent as described in section 2.1, they were randomly distributed years optimizing the user experience has increasingly gained across the 5 recommendation algorithms which calculated interest in the evaluation procedure. Existing research has for each of them a list of 8 recommendations. After the proposed a set of criteria detailing the characteristics that recommendations were made available on the website, users constitute a satisfying and e↵ective recommender system were asked by mail to fill out the accompanying online ques- from the user’s point of view. To combine these criteria into tionnaire as described in section 2.4. a more comprehensive model which can be used to evaluate Of the 612 users who were interested in the experiment, the perceived qualities of recommender systems, Pu et al. 232 actually completed the online questionnaire regarding have developed an evaluation framework for recommender their recommendations. After removal of fake samples (i.e., systems [10]. This framework aims to assess the perceived users who answered every question with the same value) qualities of recommenders such as their usefulness, usabil- and users with incomplete (feedback) profiles, 193 users re- ity, interface and interaction qualities, user satisfaction of mained. They had by average 22 consumptions (i.e., ex- the systems and the influence of these qualities on users’ pressed feedback values for events) and 84% of them had 5 behavioral intentions including their intention to tell their or more consumptions. The final distribution of the users friends about the system, the purchase of the products rec- across the algorithms is displayed in Table 3. ommended to them, and the return to the system in the Figure 1 shows the averaged results of the answers pro- future. Therefore, we adopted (part of) this framework to vided by the 193 users in this experiment for the 8 questions measure users’ subjective attitudes based on their experience we described in section 2.4 and for each algorithm. towards the event recommender and the various algorithms Evaluating the answers to the questionnaire showed that tested during our experiment. Via an online questionnaire, the hybrid recommender (CB+UBCF) achieved the best av- test users were asked to answer 14 questions on 5-point Lik- eraged results to all questions, except for question Q5, which ert scale from “strongly disagree” (1) to “strongly agree” (5) asked the user to evaluate the similarity of the recommen- regarding aspects as recommendation accuracy, novelty, di- dations (i.e. diversity). For question Q5 the random recom- versity, satisfaction and trust of the system. We selected the mender obtained the best results in terms of diversity, since following 8 most relevant questions for this research regard- random suggestions are rarely similar to each other. The CF ing various aspects of the event recommendation system. algorithm was the runner-up in the evaluation and achieved a second place after the hybrid recommender for almost all Q1 The items recommended to me matched my interests. questions (again except for Q5, where CF was the fourth af- ter the random recommender, the hybrid recommender and Q2 Some of the recommended items are familiar to me. SVD). 69 Figure 1: The averaged result of the answers (5-point Likert scale from “strongly disagree” (1) to “strongly agree” (5)) of the evaluation questionnaire for each algorithm and questions Q1, Q2, Q4, Q5, Q7, Q8, Q10 and Q13. The error bars indicate the 95% confidence interval. Note that questions Q5 and Q7 were in reverse scale. The success of the hybrid recommenders is not only clearly visible when comparing the average scores for each question (Figure 1), but also showed to be statistically significantly better than every other algorithm (except for the CF recom- mender) according to a Wilcoxon rank test (p < 0.05) for the majority of the questions (Q1, Q2, Q8, Q10 and Q13). Table 4 shows the algorithms and questions for which sta- tistically significant di↵erences could be noted according to this non-parametric statistical hypothesis test. The average performance of SVD was a bit disappointing by achieving the worst results for questions Q1, Q7, Q8, and the second worst results (after the random recommender) for questions Q2, Q4, Q10, Q11, and Q13. So surprisingly the SVD algorithm performs (averagely) worse than the ran- dom method on some fundamental questions like for example Figure 2: The histogram of the values (1 to 5) that Q8 which addresses the general user satisfaction. We note were given to question Q8 for algorithm CB (left), however that the di↵erence in values between SVD and the RAND (middle) and SVD (right). RAND algorithm was not found to be statistically significant except for question Q5. We looked more closer into this observation and plotted a histogram (Figure 2) of the di↵erent values (1 to 5) for rating values for the SVD recommender are not only visible the answers provided for question Q8. A clear distinction in the results of Q8, but also for other questions like Q2 between the histogram of the SVD algorithm and the his- and Q5. These findings indicate that SVD works well for tograms of the other algorithms (CB and RAND shown in many users, but also provides inaccurate recommendations the figure) can be seen. Whereas for CB and RAND most for a considerable number of other users. These inaccurate values are grouped towards one side of the histogram (i.e. recommendations may be due to a limited amount of user the higher values), this is not the case for the SVD. It turns feedback and therefore sketchy user profiles. out that the opinions about the general satisfaction of the Figure 1 seems to indicate that some of the answers to SVD algorithm where somewhat divided between good and the questions are highly correlated. One clear example is bad with no apparent winning answer. These noteworthy question Q1 about whether or not the recommended items 70 CB CB+UBCF RAND SVD UBCF Q1, Q2, Q5, CB - Q2, Q5 Q1, Q5, Q7, Q8 Q2, Q5, Q10 Q8, Q10, Q13 Q1, Q2, Q5, Q1, Q2, Q4, Q5, Q1, Q2, Q7, CB+UBCF - Q13 Q8, Q10, Q13 Q7, Q8, Q10, Q13 Q8, Q10, Q13 Q1, Q2, Q4, Q5, RAND Q2, Q5 - Q5 Q2, Q5, Q10 Q7, Q8, Q10, Q13 Q1, Q2, Q7, Q1, Q2, Q7, SVD Q1, Q5, Q7, Q8 Q5 - Q8, Q10, Q13 Q8, Q10 Q1, Q2, Q7, UBCF Q2, Q5, Q10 Q13 Q2, Q5, Q10 - Q8, Q10 Table 4: The complete matrix of statistically significant di↵erences between the algorithms on all the questions using the Wilcoxon rank test on a confidence level of 0.95. Note that the matrix is symmetric. matched the user’s interest and question Q8 which asked percentage of the variance in the dependent variable can be about the general user satisfaction. As obvious as this cor- explained by the model. R2 will be 1 for a perfect fit and 0 relation may be, other correlated questions may not be so when no linear relationship could be found. easy to detect by inspecting a graph with averaged results and so we calculated the complete correlation matrix for Q1 Q7, Q8, Q10, Q13 (R2 = 0.7131) every question over all the algorithms using the two-tailed Pearson correlation metric (Table 5). Q2 Q7, Q10, Q13 (R2 = 0.2195) From the correlation values two similar trends can be no- Q4 Q10, Q13 (R2 = 0.326) ticed for questions Q8 and Q10 dealing with respectively the user satisfaction and trust of the system. The answers Q5 Q1, Q13 (R2 = 0.02295) to these questions are highly correlated (very significant p < 0.01) with almost every other question except for Q5 Q7 Q1, Q2, Q8, Q10 (R2 = 0.6095) (diversity). We must be careful not to confuse correlation Q8 Q1, Q7, Q10, Q13 (R2 = 0.747) with causality, but still data indicates the strong relation between user satisfaction and recommendation accuracy and Q10 Q1, Q2, Q4, Q7, Q8, Q13 (R2 = 0.7625) transparency. This strong relation may be another reason why SVD per- Q13 Q1, Q2, Q4, Q5, Q8, Q10 (R2 = 0.6395) formed very badly in the experiment. Its inner workings are the most obscure and least obvious to the user and therefore The most interesting regression result is the line were Q8 also the least transparent. (satisfaction) is predicted by Q1, Q7, Q10 and Q13. This Another interesting observation lies in the correlation val- result further strengthens our belief that accuracy (Q1) and ues of question Q5. The answers to this diversity question transparency (Q7) are the main influencers of user satisfac- are almost completely unrelated to every other question (i.e., tion in our experiment (we consider Q10 and Q13 rather as low correlation values which are not significant p > 0.05). results of satisfaction than real influencers but they are of It seems like the users of the experiment did not value the course also connected). diversity of a recommendation list as much as the other as- Table 6 shows the coverage of the algorithms in terms of pects of the recommendation system. If we look at the av- the number of users it was able to produce recommendations erage results (Figure 1) of the diversity question (lower is for. In our experiment we noticed an average coverage of more diverse) we can see this idea confirmed. The ordering 66% excluding the random recommender. of how diverse the recommendation lists produced by the algorithms were, is in no way reflected in the general user Algorithm Coverage (%) satisfaction or trust of the system. CB 69% To gain some deeper insight into the influence of the qual- CB+UBCF 66% itative attributes towards each other, we performed a sim- RAND 100% ple linear regression analysis. By trying to predict an at- SVD 66% tribute by using all the other ones as input to the regression UBCF 65% function, a hint of causality may be revealed. As regres- sion method we used multiple stepwise regression. We used Table 6: The 5 algorithms compared in this experi- a combination of the forward and backward selection ap- ment and their coverage in terms of the number of proach, which step by step tries to add new variables (or users for which they were able to generate a recom- remove existing ones) to its model that have the highest mendation list of minimum 8 items. marginal relative influence on the dependent variable. The following lines express the regression results. We indicated Next to this online and user-centric experiment, we also what attributes were added to the model by means of an ran some o✏ine tests and compared them to the real opin- arrow notation. Between brackets we also indicated the co- ions of the users. We calculated the recommendations on a efficient of determination R2 . This coefficient indicates what training set that randomly contained 80% of the collected feedback in the experiment. Using the leftover 20% as the 71 Q1 Q2 Q4 Q5 Q7 Q8 Q10 Q13 (accuracy) (familiarity) (novelty) (diversity) (tranparency) (satisfaction) (trust) (usefulness) Q1 1 .431 .459 .012 -.731 .767 .783 .718 Q2 .431 1 .227 .036 -.405 .387 .429 .415 Q4 .459 .227 1 -.037 -.424 .496 .516 .542 Q5 .012 .036 -.037 1 0.16 -.008 .001 -.096 Q7 -.731 -.405 -.424 .016 1 -.722 -.707 -.622 Q8 .767 .387 .496 -.008 -.722 1 .829 .712 Q10 .783 .429 .516 .001 -.707 .829 1 .725 Q13 .718 .415 .542 -.096 -.622 .712 .725 1 Table 5: The complete correlation matrix for the answers to the 8 most relevant questions on the online questionnaire. The applied metric is the Pearson correlation and so values are distributed between -1.0 (negatively correlated) and 1.0 (positively correlated). Note that the matrix is symmetric and questions Q5 and Q7 were in reverse scale. Algorithm Precision (%) Recall (%) F1 (%) system. The runner-up for this position would definitely be CB 0.462 2.109 0.758 the UBCF algorithm followed by the CB algorithm. This CB+UBCF 1.173 4.377 1.850 comes as no surprise considering that the hybrid algorithm RAND 0.003 0.015 0.005 is mere a combination of these UBCF and CB algorithms. SVD 0.573 2.272 0.915 Since the UBCF algorithm is second best, it looks like this UBCF 1.359 4.817 2.119 algorithm is the most responsible for the success of the hy- brid. While the weights of both algorithms were equal in Table 7: The accuracy of the recommendation algo- this experiment (i.e., the 4 best recommendations of each rithms in terms of precision, recall and F1-measure list were selected to be combined in the hybrid list), it would based on an o✏ine analysis. be interesting to see how the results evolve if these weights would be tuned more in favour of the CF approach (e.g., 5 ⇤ U BCF + 3 ⇤ CB). test set, the accuracy of every algorithm was calculated over Because we collected both implicit and explicit feedback all users in terms of precision, recall and F1-measure (Table to serve as input for the recommendation algorithms, there 7). This procedure was repeated 10 times to average out were no restrictions as to what algorithms we were able to any random e↵ects. use. Implicit feedback that was logged before an event took By comparing the o✏ine and online results in our exper- place allowed the use of CF algorithms and the availability iment we noticed a small change in the ranking of the al- of item metadata enabled content-based approaches. Only gorithms. In terms of precision the UBCF approach came in this ideal situation a hybrid CB+UBCF algorithm can out best followed by respectively CB+UBCF, SVD, CB and serve an event recommendation system. RAND. While the hybrid approach performed best in the The slightly changed coverage is another issue that may online analysis, this is not the case for the o✏ine tests. come up when a hybrid algorithm like this is deployed. While Note that also SVD and CB have swapped places in the the separate CB and UBCF algorithms had respectively cov- ranking. SVD showed slightly better at predicting user be- erages of 69% and 65%, the hybrid combination served 66% haviour than the CB algorithm. A possible explanation (for of the users. We can explain this increase of 1% towards the inverse online results) is that users in the online test the UBCF by noting that the hybrid algorithm requires a may have valued the transparency of the CB algorithm over minimum of only 4 recommendations (versus 8 normally) to its (objective) accuracy. Our o✏ine evaluation test further be able to provide the users with a recommendation list. underlines the shortcomings of these procedures. In our ex- periment we had over 30,000 items that were available for recommendation and on average only 22 consumptions per 5. CONCLUSIONS user. The extreme low precision and recall values are the For a Belgian cultural events website we wanted to find result of this extreme sparsity problem. a recommendation algorithm that improves the user expe- It would have been interesting to be able to correlate the rience in terms of user satisfaction and trust. Since o✏ine accuracy values obtained by o✏ine analysis with the subjec- evaluation metrics are inadequate for this task, we have set tive accuracy values provided by the users. Experiments up an online and user-centric evaluation experiment with 5 however showed very fluctuating results with on the one popular and common recommendation algorithms i.e. CB, hand users with close to zero precision and on the other CB+UBCF, RAND, SVD and UBCF. We logged both im- hand some users with relative high precision values. These plicit and explicit feedback data in the form of weighted results could therefore not be properly matched against the user interactions with the event website over a period of 41 online gathered results. days. We extracted the users for which every algorithm was able to generate at least 8 recommendations and presented each of these users with a recommendation list randomly 4. DISCUSSION chosen from one of the 5 recommendation algorithms. Users The results clearly indicate the hybrid recommendation were asked to fill out an online questionnaire that addressed algorithm (CB+UBCF) as the overall best algorithm for op- qualitative aspects of their recommendation lists including timizing the user satisfaction in our event recommendation accuracy, novelty, diversity, satisfaction, and trust. 72 Results clearly showed that the CB+UBCF algorithm, Proceedings of the Indian International Conference on which is a combination of both the recommendations of CB Artificial Intelligence, 2005. and UBCF, outperforms (or is equally as good in the case [4] S. Dooms, T. De Pessemier, and L. Martens. An of question Q2 and the UBCF algorithm) every other al- online evaluation of explicit feedback mechanisms for gorithm except for the diversity aspect. In terms of diver- recommender systems. In Proceedings of the 7th sity the random recommendations turned out best, which International Conference on Web Information Systems of course makes perfectly good sense. Inspection of the and Technologies (WEBIST), 2011. correlation values between the answers of the questions re- [5] X. Guo, G. Zhang, E. Chew, and S. Burdon. A hybrid vealed however that diversity is in no way correlated with recommendation approach for one-and-only items. AI user satisfaction, trust or for that matter any other qualita- 2005: Advances in Artificial Intelligence, pages tive aspect we investigated. The recommendation accuracy 457–466, 2005. and transparency on the other hand were the two qualita- [6] C. Hayes, P. Massa, P. Avesani, and P. Cunningham. tive aspects highest correlated with the user satisfaction and An on-line evaluation framework for recommender showed promising predictors in the regression analysis. systems. In Workshop on Personalization and The SVD algorithm came out last in the ranking of the Recommendation in E-Commerce. Citeseer, 2002. algorithms and was statistically even indistinguishable from [7] R. Klamma, P. Cuong, and Y. Cao. You never walk the random recommender for most of the questions except alone: Recommending academic events based on social for again the diversity question (Q5). A histogram of the network analysis. Complex Sciences, pages 657–670, values for SVD and question Q8 puts this into context by re- 2009. vealing an almost black and white opinion pattern expressed [8] D. Lee. Pittcult: trust-based cultural event by the users in the experiment. recommender. In Proceedings of the 2008 ACM conference on Recommender systems, pages 311–314. 6. FUTURE WORK ACM, 2008. While we were able to investigate numerous di↵erent qual- [9] S. McNee, J. Riedl, and J. Konstan. Being accurate is itative aspect about each algorithm individually, the exper- not enough: how accuracy metrics have hurt iment did not allow us, apart from indicating a best and recommender systems. In CHI’06 extended abstracts worst algorithm, to construct an overall ranking of the rec- on Human factors in computing systems, page 1101. ommendation algorithms. Each user ended up evaluating ACM, 2006. just one algorithm. As our future work, we intend to extend [10] P. Pu and L. Chen. A user-centric evaluation this experiment with a focus group allowing to elaborate on framework of recommender systems. In Proc. ACM the reasoning behind some of the answers users provided and RecSys 2010 Workshop on User-Centric Evaluation of compare subjective rankings of the algorithms. Recommender Systems and Their Interfaces We also plan to extend our regression analysis to come up (UCERSTI), 2010. with a causal path model that will allow us to have a better [11] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, and understanding as to how the di↵erent algorithms influence M. U. M. D. O. C. SCIENCE. Application of the overall satisfaction. dimensionality reduction in recommender system-a case study. Citeseer, 2000. 7. ACKNOWLEDGMENTS The research activities that have been described in this pa- per were funded by a PhD grant to Simon Dooms of the In- stitute for the Promotion of Innovation through Science and Technology in Flanders (IWT Vlaanderen) and a PhD grant to Toon De Pessemier of the Fund for Scientific Research- Flanders (FWO Vlaanderen). We would like to thank Cul- tuurNet Vlaanderen for the e↵ort and support they were willing to provide for deploying the experiment described in this paper. 8. REFERENCES [1] D. Bollen, B. Knijnenburg, M. Willemsen, and M. Graus. Understanding choice overload in recommender systems. In Proceedings of the fourth ACM conference on Recommender systems, pages 63–70. ACM, 2010. [2] E. Campochiaro, R. Casatta, P. Cremonesi, and R. Turrin. Do metrics make recommender algorithms? In Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops, WAINA ’09, pages 648–653, Washington, DC, USA, 2009. IEEE Computer Society. [3] C. Cornelis, X. Guo, J. Lu, and G. Zhang. A fuzzy relational approach to event recommendation. In 73