=Paper= {{Paper |id=None |storemode=property |title=A User-centric Evaluation of Recommender Algorithms for an Event Recommendation System |pdfUrl=https://ceur-ws.org/Vol-811/paper10.pdf |volume=Vol-811 }} ==A User-centric Evaluation of Recommender Algorithms for an Event Recommendation System== https://ceur-ws.org/Vol-811/paper10.pdf
A User-centric Evaluation of Recommender Algorithms for
           an Event Recommendation System

                 Simon Dooms                       Toon De Pessemier                             Luc Martens
         Wica-INTEC, IBBT-Ghent University     Wica-INTEC, IBBT-Ghent University        Wica-INTEC, IBBT-Ghent University
           G. Crommenlaan 8 box 201             G. Crommenlaan 8 box 201                 G. Crommenlaan 8 box 201
             B-9050 Ghent, Belgium                B-9050 Ghent, Belgium                    B-9050 Ghent, Belgium
           Simon.Dooms@UGent.be                Toon.DePessemier@UGent.be                  Luc1.Martens@UGent.be

ABSTRACT                                                              hand is relatively new. Events are so called one-and-only
While several approaches to event recommendation already              items [5], which makes them harder to recommend. While
exist, a comparison study including di↵erent algorithms re-           other types of items generally remain available (and thus
mains absent. We have set up an online user-centric based             recommendable) for longer periods of time, this is not the
evaluation experiment to find a recommendation algorithm              case for events. They take place at a specific moment in time
that improves user satisfaction for a popular Belgian cul-            and place to become irrelevant very quickly afterwards.
tural events website. Both implicit and explicit feedback in             Some approaches towards event recommendation do exist.
the form of user interactions with the website were logged            For the Pittsburgh area, a cultural event recommender was
over a period of 41 days, serving as the input for 5 popular          build around trust relations [8]. Friends could be explicitly
recommendation approaches. By means of a questionnaire                and implicitly rated for trust ranging from ‘trust strongly’ to
users were asked to rate di↵erent qualitative aspects of the          ‘block’. A recommender system for academic events [7] fo-
recommender system including accuracy, novelty, diversity,            cused more on social network analysis (SNA) in combination
satisfaction, and trust.                                              with collaborative filtering (CF) and finally Cornelis et al. [3]
   Results show that a hybrid of a user-based collaborative           described a hybrid event recommendation approach where
filtering and content-based approach outperforms the other            both aspects of CF and content-based algorithms were em-
algorithms on almost every qualitative metric. Correlation            ployed. To our knowledge however, event recommendation
values between the answers in the questionnaire seem to               algorithms were never compared in a user-centric designed
indicate that both accuracy and transparency are correlated           experiment with a focus on optimal user satisfaction.
the most with general user satisfaction of the recommender               For a comparison of algorithms often o✏ine metrics like
system.                                                               RMSE, MAE or precision and recall are calculated. These
                                                                      kinds of metrics allow automated and objective comparison
                                                                      of the accuracy of the algorithms but they alone can not
Categories and Subject Descriptors                                    guarantee user satisfaction in the end [9]. As shown in [2],
H.4 [Information Systems Applications]: Miscellaneous;                the use of di↵erent o✏ine metrics can even lead to a di↵erent
H5.2 [User Interfaces]: User-centered design                          outcome of the ‘best’ algorithm for the job. Hayes et al. [6]
                                                                      state that real user satisfaction can only be measured in an
General Terms                                                         online context. We want to improve the user satisfaction for
                                                                      real-life users of the event website and are therefore opting
Algorithms, Experimentation, Human Factors.                           for an online user-centric evaluation of di↵erent recommen-
                                                                      dation algorithms.
Keywords
Recommender systems, events, user-centric evaluation, ex-             2.   EXPERIMENT SETUP
periment, correlation, recommendation algorithms.
                                                                         To find the recommendation algorithm that results in the
                                                                      highest user satisfaction, we have set up a user-centric eval-
1.   INTRODUCTION                                                     uation experiment. For a period of 41 days, we monitored
  More and more recommender systems are being integrated              both implicit and explicit user feedback in the form of user
with web based platforms that su↵er from information over-            interactions with the event website. We used the collected
load. By personalizing content based on user preferences,             feedback as input for 5 di↵erent recommendation algorithms,
recommender systems assist in selecting relevant items on             each of which generated a list of recommendations for every
these websites. In this paper, we focus on evaluating rec-            user. Bollen et al. [1] hypothesizes that a set of somewhere
ommendations for a Belgian cultural events website. This              between seven and ten items would be ideal in the sense that
website contains the details of more than 30,000 near future          it can be quite varied but still manageable for the users. The
and ongoing cultural activities including movie releases, the-        users therefore received a randomly chosen recommendation
ater shows, exhibitions, fairs and many others.                       list containing 8 events together with an online question-
  In the research domain of recommender systems, numer-               naire. They were asked to rate di↵erent aspects about the
ous studies have focused on recommending movies. They                 quality of their given recommendations.
have been studied thoroughly and many best practices are                 In the following subsections, we elaborate on the specifics
known. The area of event recommendations on the other                 of the experiment such as the feedback collection, the rec-




                                                                 67
             Feedback activity               Feedback value                                   Metadata field     Weight
       Click on ‘I like this’                     1.0                                         Artist              1.0
       Share on Facebook/Twitter                  0.9                                         Category            0.7
       Click on Itinerary                         0.6                                         Keyword             0.2
       Click on Print                             0.6
       Click on ‘Go by bus/train’                 0.6                       Table 2: The metadata fields used by the content-
       Click on ‘Show more details’               0.5                       based recommendation algorithm with their weights
       Click on ‘Show more dates’                 0.5                       indicating their relative importance.
        Mail to a friend                          0.4
      Browse to an event                          0.3

Table 1: The distinct activities that were collected                        and if so, the relative (accuracy) improvement of more in-
as user feedback together with the feedback value                           telligent algorithms over random recommendations.
indicating the interest of an individual user for a                            Because of its widespread use and general applicability,
specific event ranging from 1.0 (very interested) to                        standard collaborative filtering (CF) is chosen as the second
0.3 (slightly interested).                                                  algorithm of the experiment. We opted for the user-based
                                                                            nearest neighbor version of the algorithm (UBCF) because
                                                                            of the higher user-user overlap compared to the item-item
                                                                            overlap. Neighbors were defined as being users with a min-
ommendation algorithms, how we randomized the users, and
                                                                            imum overlap of 1 event in their feedback profiles but had
the questionnaire.
                                                                            to be at least 5% similar according to the cosine similarity
                                                                            metric.
2.1    Feedback collection                                                     The third algorithm evaluated in this experiment is sin-
   Feedback collection is a very important aspect of the rec-               gular value decomposition (SVD) [11], a well-known matrix
ommendation process. Since the final recommendations can                    factorization technique that addresses the problems of syn-
only be as good as the quality of their input, collecting as                onymy, polysemy, sparsity, and scalability for large datasets.
much high quality feedback as possible is of paramount im-                  Based on preceding simulations on an o✏ine dataset with
portance. Previous feedback experiments we ran on the web-                  historical data of the website, the parameters of the algo-
site [4] showed that collecting explicit feedback (in the form              rithm were determined: 100 initial steps were used to train
of explicit ratings) is very hard, since users do not rate of-              the model and the number of features was set at 70.
ten. Clicking and browsing through the event information                       Considering the transiency of events and the ability of
pages are on the other hand activities that were abundantly                 content-based (CB) algorithms to recommend items before
logged. For optimal results, we ultimately combined im-                     they received any feedback, a CB algorithm was chosen as
plicit and explicit user feedback gathered during the run of                the fourth algorithm. This algorithm matches the event
the experiment.                                                             metadata, which contain the title, the categories, the artist(s),
   Since explicit ratings are typically provided after an event             and keywords originating from a textual description of the
has been visited, algorithms based on collaborative filtering               event, to the personal preferences of the user, which are
would be useless. It therefore makes sense to utilize also                  composed by means of these metadata and the user feed-
implicit feedback indicators like printing the event’s infor-               back gathered during the experiment. A weighting value
mation, which can be collected before the event has taken                   is assigned to the various metadata fields (see Table 2),
place. In total 11 distinct feedback activities were combined               thereby attaching a relative importance to the fields during
into a feedback value that expressed the interest of a user                 the matching process (e.g., a user preference for an artist is
for a specific event.                                                       more important than a user preference for a keyword of the
   The di↵erent activities are listed in Table 1 together with              description). The employed keyword extraction mechanism
their resulting feedback values which were intuitively deter-               is based on a term frequency-inverse document frequency
mined. The max() function is used to accumulate multiple                    (tf-idf) weighting scheme, and includes features as stemming
feedback values in case a user provided feedback in more                    and filtering stop words.
than one way for the same event.                                               Since pure CB algorithms might produce recommenda-
                                                                            tions with a limited diversity [9], and CF techniques might
2.2    Recommendation Algorithms                                            produce suboptimal results due to a large amount of unrated
   To assess the influence of the recommendation algorithm                  items (cold start problem), a hybrid algorithm (CB+UBCF),
on the experience of the end-user, 5 di↵erent algorithms are                combining features of both CB and CF techniques, com-
used in this experiment. Each user, unaware of the di↵er-                   pletes the list. This fifth algorithm combines the best per-
ent algorithms, is randomly assigned to one of the 5 groups                 sonal sugestions produced by the CF with the best suges-
receiving recommendations generated by one of these algo-                   tions originating from the CB algorithm, thereby generat-
rithms as described in Section 2.3.                                         ing a merged list of hybrid recommendations for every user.
   As a baseline suggestion mechanism, the random recom-                    This algorithm acts on the resulting recommendation lists
mender (RAND), which generates recommendations by per-                      produced by the CF and CB recommender, and does not
forming a random sampling of the available events, is used.                 change the internal working of these individual algorithms.
The only requirement of these random recommendations is                     Both lists are interwoven while alternately switching their
that the event is still available (i.e. it is still possible for the        order such that both lists have their best recommendation
user to attend the event). The evaluation of these random                   on top in 50% of the cases.
recommendations allows to investigate if users can distin-                     For each algorithm, the final event recommendations are
guish random events from personalized recommendations,                      checked for their availability and familiarity with the user.




                                                                       68
Events that are not available for attendance anymore, or                                 Algorithm      #U sers
events that the user has already explored (by viewing the                                    CB           43
webpage, or clicking the link) are replaced in the recom-                                CB+UBCF          36
mendation list.                                                                           RAND            45
                                                                                            SVD           36
2.3   Randomizing Users                                                                    UBCF           33
   Since certain users have provided only a limited amount of
feedback during the experiment, not all recommendation al-            Table 3: The 5 algorithms compared in this exper-
gorithms were able to generate personal suggestions for these         iment and the number of users that actually com-
users. CF algorithms, for instance, can only identify neigh-          pleted the questionnaire about their recommenda-
bors for users who have overlapping feedback with other               tion lists.
users (i.e. provided feedback on the same event as another
user). Without these neighbors, CF algorithms are not able            Q4 The recommender system helps me discover new prod-
to produce recommendations. Therefore, users with a lim-                  ucts.
ited profile, hindering (some of) the algorithms to generate
(enough) recommendations for that user, are treated sepa-             Q5 The items recommended to me are similar to each other
rately in the analysis. Many of these users are not very ac-              (reverse scale).
tive on the website or did not finish the evaluation procedure
as described in Section 2.4. This group of cold-start users           Q7 I didn’t understand why the items were recommended
received recommendations from a randomly assigned algo-                   to me (reverse scale).
rithm that was able to generate recommendations for that
user based on the limited profile. Since the random recom-            Q8 Overall, I am satisfied with the recommender.
mender can produce suggestions even without user feedback,
at least 1 algorithm was able to generate a recommendation            Q10 The recommender can be trusted.
list for every user. The comparative evaluation of the 5 al-
gorithms however, is based on the remaining users. Each               Q13 I would attend some of the events recommended, given
of these users is randomly assigned to 1 of the 5 algorithms              the opportunity.
which generates personal suggestions for that user. This
way, the 5 algorithms, as described in Section 2.2, are eval-         3.   RESULTS
uated by a number of randomly selected users.                            We allowed all users of the event website to participate in
                                                                      our experiment and encouraged them to do so by means of
2.4   Evaluation Procedure                                            e-mail and a banner on the site. In total 612 users responded
   While prediction accuracy of ratings used to be the only           positively to our request. After a period of feedback logging,
evaluation criteria for recommender systems, during recent            as described in section 2.1, they were randomly distributed
years optimizing the user experience has increasingly gained          across the 5 recommendation algorithms which calculated
interest in the evaluation procedure. Existing research has           for each of them a list of 8 recommendations. After the
proposed a set of criteria detailing the characteristics that         recommendations were made available on the website, users
constitute a satisfying and e↵ective recommender system               were asked by mail to fill out the accompanying online ques-
from the user’s point of view. To combine these criteria into         tionnaire as described in section 2.4.
a more comprehensive model which can be used to evaluate                 Of the 612 users who were interested in the experiment,
the perceived qualities of recommender systems, Pu et al.             232 actually completed the online questionnaire regarding
have developed an evaluation framework for recommender                their recommendations. After removal of fake samples (i.e.,
systems [10]. This framework aims to assess the perceived             users who answered every question with the same value)
qualities of recommenders such as their usefulness, usabil-           and users with incomplete (feedback) profiles, 193 users re-
ity, interface and interaction qualities, user satisfaction of        mained. They had by average 22 consumptions (i.e., ex-
the systems and the influence of these qualities on users’            pressed feedback values for events) and 84% of them had 5
behavioral intentions including their intention to tell their         or more consumptions. The final distribution of the users
friends about the system, the purchase of the products rec-           across the algorithms is displayed in Table 3.
ommended to them, and the return to the system in the                    Figure 1 shows the averaged results of the answers pro-
future. Therefore, we adopted (part of) this framework to             vided by the 193 users in this experiment for the 8 questions
measure users’ subjective attitudes based on their experience         we described in section 2.4 and for each algorithm.
towards the event recommender and the various algorithms                 Evaluating the answers to the questionnaire showed that
tested during our experiment. Via an online questionnaire,            the hybrid recommender (CB+UBCF) achieved the best av-
test users were asked to answer 14 questions on 5-point Lik-          eraged results to all questions, except for question Q5, which
ert scale from “strongly disagree” (1) to “strongly agree” (5)        asked the user to evaluate the similarity of the recommen-
regarding aspects as recommendation accuracy, novelty, di-            dations (i.e. diversity). For question Q5 the random recom-
versity, satisfaction and trust of the system. We selected the        mender obtained the best results in terms of diversity, since
following 8 most relevant questions for this research regard-         random suggestions are rarely similar to each other. The CF
ing various aspects of the event recommendation system.               algorithm was the runner-up in the evaluation and achieved
                                                                      a second place after the hybrid recommender for almost all
Q1 The items recommended to me matched my interests.                  questions (again except for Q5, where CF was the fourth af-
                                                                      ter the random recommender, the hybrid recommender and
Q2 Some of the recommended items are familiar to me.                  SVD).




                                                                 69
Figure 1: The averaged result of the answers (5-point Likert scale from “strongly disagree” (1) to “strongly
agree” (5)) of the evaluation questionnaire for each algorithm and questions Q1, Q2, Q4, Q5, Q7, Q8, Q10 and
Q13. The error bars indicate the 95% confidence interval. Note that questions Q5 and Q7 were in reverse
scale.


   The success of the hybrid recommenders is not only clearly
visible when comparing the average scores for each question
(Figure 1), but also showed to be statistically significantly
better than every other algorithm (except for the CF recom-
mender) according to a Wilcoxon rank test (p < 0.05) for
the majority of the questions (Q1, Q2, Q8, Q10 and Q13).
Table 4 shows the algorithms and questions for which sta-
tistically significant di↵erences could be noted according to
this non-parametric statistical hypothesis test.
   The average performance of SVD was a bit disappointing
by achieving the worst results for questions Q1, Q7, Q8, and
the second worst results (after the random recommender)
for questions Q2, Q4, Q10, Q11, and Q13. So surprisingly
the SVD algorithm performs (averagely) worse than the ran-
dom method on some fundamental questions like for example             Figure 2: The histogram of the values (1 to 5) that
Q8 which addresses the general user satisfaction. We note             were given to question Q8 for algorithm CB (left),
however that the di↵erence in values between SVD and the              RAND (middle) and SVD (right).
RAND algorithm was not found to be statistically significant
except for question Q5.
   We looked more closer into this observation and plotted
a histogram (Figure 2) of the di↵erent values (1 to 5) for            rating values for the SVD recommender are not only visible
the answers provided for question Q8. A clear distinction             in the results of Q8, but also for other questions like Q2
between the histogram of the SVD algorithm and the his-               and Q5. These findings indicate that SVD works well for
tograms of the other algorithms (CB and RAND shown in                 many users, but also provides inaccurate recommendations
the figure) can be seen. Whereas for CB and RAND most                 for a considerable number of other users. These inaccurate
values are grouped towards one side of the histogram (i.e.            recommendations may be due to a limited amount of user
the higher values), this is not the case for the SVD. It turns        feedback and therefore sketchy user profiles.
out that the opinions about the general satisfaction of the              Figure 1 seems to indicate that some of the answers to
SVD algorithm where somewhat divided between good and                 the questions are highly correlated. One clear example is
bad with no apparent winning answer. These noteworthy                 question Q1 about whether or not the recommended items




                                                                 70
                               CB               CB+UBCF                      RAND                SVD              UBCF
                                               Q1, Q2, Q5,
              CB                 -                                         Q2, Q5           Q1, Q5, Q7, Q8     Q2, Q5, Q10
                                               Q8, Q10, Q13
                          Q1, Q2, Q5,                                  Q1, Q2, Q4, Q5,       Q1, Q2, Q7,
          CB+UBCF                                     -                                                            Q13
                          Q8, Q10, Q13                                 Q7, Q8, Q10, Q13      Q8, Q10, Q13
                                            Q1, Q2, Q4, Q5,
            RAND             Q2, Q5                                           -                   Q5           Q2, Q5, Q10
                                            Q7, Q8, Q10, Q13
                                              Q1, Q2, Q7,                                                      Q1, Q2, Q7,
             SVD         Q1, Q5, Q7, Q8                                       Q5                   -
                                             Q8, Q10, Q13                                                       Q8, Q10
                                                                                             Q1, Q2, Q7,
            UBCF          Q2, Q5, Q10               Q13                  Q2, Q5, Q10                                -
                                                                                              Q8, Q10

Table 4: The complete matrix of statistically significant di↵erences between the algorithms on all the questions
using the Wilcoxon rank test on a confidence level of 0.95. Note that the matrix is symmetric.


matched the user’s interest and question Q8 which asked                 percentage of the variance in the dependent variable can be
about the general user satisfaction. As obvious as this cor-            explained by the model. R2 will be 1 for a perfect fit and 0
relation may be, other correlated questions may not be so               when no linear relationship could be found.
easy to detect by inspecting a graph with averaged results
and so we calculated the complete correlation matrix for                Q1        Q7, Q8, Q10, Q13 (R2 = 0.7131)
every question over all the algorithms using the two-tailed
Pearson correlation metric (Table 5).                                   Q2        Q7, Q10, Q13 (R2 = 0.2195)
   From the correlation values two similar trends can be no-            Q4        Q10, Q13 (R2 = 0.326)
ticed for questions Q8 and Q10 dealing with respectively
the user satisfaction and trust of the system. The answers              Q5        Q1, Q13 (R2 = 0.02295)
to these questions are highly correlated (very significant
p < 0.01) with almost every other question except for Q5                Q7        Q1, Q2, Q8, Q10 (R2 = 0.6095)
(diversity). We must be careful not to confuse correlation              Q8        Q1, Q7, Q10, Q13 (R2 = 0.747)
with causality, but still data indicates the strong relation
between user satisfaction and recommendation accuracy and               Q10        Q1, Q2, Q4, Q7, Q8, Q13 (R2 = 0.7625)
transparency.
   This strong relation may be another reason why SVD per-              Q13        Q1, Q2, Q4, Q5, Q8, Q10 (R2 = 0.6395)
formed very badly in the experiment. Its inner workings are
the most obscure and least obvious to the user and therefore               The most interesting regression result is the line were Q8
also the least transparent.                                             (satisfaction) is predicted by Q1, Q7, Q10 and Q13. This
   Another interesting observation lies in the correlation val-         result further strengthens our belief that accuracy (Q1) and
ues of question Q5. The answers to this diversity question              transparency (Q7) are the main influencers of user satisfac-
are almost completely unrelated to every other question (i.e.,          tion in our experiment (we consider Q10 and Q13 rather as
low correlation values which are not significant p > 0.05).             results of satisfaction than real influencers but they are of
It seems like the users of the experiment did not value the             course also connected).
diversity of a recommendation list as much as the other as-                Table 6 shows the coverage of the algorithms in terms of
pects of the recommendation system. If we look at the av-               the number of users it was able to produce recommendations
erage results (Figure 1) of the diversity question (lower is            for. In our experiment we noticed an average coverage of
more diverse) we can see this idea confirmed. The ordering              66% excluding the random recommender.
of how diverse the recommendation lists produced by the
algorithms were, is in no way reflected in the general user                              Algorithm     Coverage (%)
satisfaction or trust of the system.                                                         CB            69%
   To gain some deeper insight into the influence of the qual-                           CB+UBCF           66%
itative attributes towards each other, we performed a sim-                                RAND            100%
ple linear regression analysis. By trying to predict an at-                                 SVD            66%
tribute by using all the other ones as input to the regression                             UBCF            65%
function, a hint of causality may be revealed. As regres-
sion method we used multiple stepwise regression. We used               Table 6: The 5 algorithms compared in this experi-
a combination of the forward and backward selection ap-                 ment and their coverage in terms of the number of
proach, which step by step tries to add new variables (or               users for which they were able to generate a recom-
remove existing ones) to its model that have the highest                mendation list of minimum 8 items.
marginal relative influence on the dependent variable. The
following lines express the regression results. We indicated               Next to this online and user-centric experiment, we also
what attributes were added to the model by means of an                  ran some o✏ine tests and compared them to the real opin-
arrow notation. Between brackets we also indicated the co-              ions of the users. We calculated the recommendations on a
efficient of determination R2 . This coefficient indicates what         training set that randomly contained 80% of the collected
                                                                        feedback in the experiment. Using the leftover 20% as the




                                                                  71
                  Q1           Q2             Q4           Q5              Q7               Q8          Q10          Q13
             (accuracy)   (familiarity)   (novelty)   (diversity)    (tranparency)    (satisfaction)   (trust)   (usefulness)
      Q1           1          .431           .459         .012            -.731            .767         .783         .718
      Q2         .431           1            .227         .036            -.405            .387         .429         .415
      Q4         .459         .227             1         -.037            -.424            .496         .516         .542
      Q5         .012         .036          -.037           1              0.16           -.008         .001        -.096
      Q7        -.731        -.405          -.424         .016               1            -.722         -.707       -.622
      Q8         .767         .387          .496         -.008            -.722              1          .829         .712
      Q10        .783         .429          .516          .001            -.707            .829           1          .725
      Q13        .718         .415          .542         -.096            -.622            .712         .725           1

Table 5: The complete correlation matrix for the answers to the 8 most relevant questions on the online
questionnaire. The applied metric is the Pearson correlation and so values are distributed between -1.0
(negatively correlated) and 1.0 (positively correlated). Note that the matrix is symmetric and questions Q5
and Q7 were in reverse scale.


     Algorithm     Precision (%)   Recall (%)    F1 (%)              system. The runner-up for this position would definitely be
         CB            0.462         2.109        0.758              the UBCF algorithm followed by the CB algorithm. This
     CB+UBCF           1.173         4.377        1.850              comes as no surprise considering that the hybrid algorithm
      RAND             0.003         0.015        0.005              is mere a combination of these UBCF and CB algorithms.
        SVD            0.573         2.272        0.915              Since the UBCF algorithm is second best, it looks like this
       UBCF            1.359         4.817        2.119              algorithm is the most responsible for the success of the hy-
                                                                     brid. While the weights of both algorithms were equal in
Table 7: The accuracy of the recommendation algo-                    this experiment (i.e., the 4 best recommendations of each
rithms in terms of precision, recall and F1-measure                  list were selected to be combined in the hybrid list), it would
based on an o✏ine analysis.                                          be interesting to see how the results evolve if these weights
                                                                     would be tuned more in favour of the CF approach (e.g.,
                                                                     5 ⇤ U BCF + 3 ⇤ CB).
test set, the accuracy of every algorithm was calculated over           Because we collected both implicit and explicit feedback
all users in terms of precision, recall and F1-measure (Table        to serve as input for the recommendation algorithms, there
7). This procedure was repeated 10 times to average out              were no restrictions as to what algorithms we were able to
any random e↵ects.                                                   use. Implicit feedback that was logged before an event took
   By comparing the o✏ine and online results in our exper-           place allowed the use of CF algorithms and the availability
iment we noticed a small change in the ranking of the al-            of item metadata enabled content-based approaches. Only
gorithms. In terms of precision the UBCF approach came               in this ideal situation a hybrid CB+UBCF algorithm can
out best followed by respectively CB+UBCF, SVD, CB and               serve an event recommendation system.
RAND. While the hybrid approach performed best in the                   The slightly changed coverage is another issue that may
online analysis, this is not the case for the o✏ine tests.           come up when a hybrid algorithm like this is deployed. While
Note that also SVD and CB have swapped places in the                 the separate CB and UBCF algorithms had respectively cov-
ranking. SVD showed slightly better at predicting user be-           erages of 69% and 65%, the hybrid combination served 66%
haviour than the CB algorithm. A possible explanation (for           of the users. We can explain this increase of 1% towards
the inverse online results) is that users in the online test         the UBCF by noting that the hybrid algorithm requires a
may have valued the transparency of the CB algorithm over            minimum of only 4 recommendations (versus 8 normally) to
its (objective) accuracy. Our o✏ine evaluation test further          be able to provide the users with a recommendation list.
underlines the shortcomings of these procedures. In our ex-
periment we had over 30,000 items that were available for
recommendation and on average only 22 consumptions per               5.   CONCLUSIONS
user. The extreme low precision and recall values are the               For a Belgian cultural events website we wanted to find
result of this extreme sparsity problem.                             a recommendation algorithm that improves the user expe-
   It would have been interesting to be able to correlate the        rience in terms of user satisfaction and trust. Since o✏ine
accuracy values obtained by o✏ine analysis with the subjec-          evaluation metrics are inadequate for this task, we have set
tive accuracy values provided by the users. Experiments              up an online and user-centric evaluation experiment with 5
however showed very fluctuating results with on the one              popular and common recommendation algorithms i.e. CB,
hand users with close to zero precision and on the other             CB+UBCF, RAND, SVD and UBCF. We logged both im-
hand some users with relative high precision values. These           plicit and explicit feedback data in the form of weighted
results could therefore not be properly matched against the          user interactions with the event website over a period of 41
online gathered results.                                             days. We extracted the users for which every algorithm was
                                                                     able to generate at least 8 recommendations and presented
                                                                     each of these users with a recommendation list randomly
4.   DISCUSSION                                                      chosen from one of the 5 recommendation algorithms. Users
  The results clearly indicate the hybrid recommendation             were asked to fill out an online questionnaire that addressed
algorithm (CB+UBCF) as the overall best algorithm for op-            qualitative aspects of their recommendation lists including
timizing the user satisfaction in our event recommendation           accuracy, novelty, diversity, satisfaction, and trust.




                                                                72
   Results clearly showed that the CB+UBCF algorithm,                       Proceedings of the Indian International Conference on
which is a combination of both the recommendations of CB                    Artificial Intelligence, 2005.
and UBCF, outperforms (or is equally as good in the case                [4] S. Dooms, T. De Pessemier, and L. Martens. An
of question Q2 and the UBCF algorithm) every other al-                      online evaluation of explicit feedback mechanisms for
gorithm except for the diversity aspect. In terms of diver-                 recommender systems. In Proceedings of the 7th
sity the random recommendations turned out best, which                      International Conference on Web Information Systems
of course makes perfectly good sense. Inspection of the                     and Technologies (WEBIST), 2011.
correlation values between the answers of the questions re-             [5] X. Guo, G. Zhang, E. Chew, and S. Burdon. A hybrid
vealed however that diversity is in no way correlated with                  recommendation approach for one-and-only items. AI
user satisfaction, trust or for that matter any other qualita-              2005: Advances in Artificial Intelligence, pages
tive aspect we investigated. The recommendation accuracy                    457–466, 2005.
and transparency on the other hand were the two qualita-                [6] C. Hayes, P. Massa, P. Avesani, and P. Cunningham.
tive aspects highest correlated with the user satisfaction and              An on-line evaluation framework for recommender
showed promising predictors in the regression analysis.                     systems. In Workshop on Personalization and
   The SVD algorithm came out last in the ranking of the                    Recommendation in E-Commerce. Citeseer, 2002.
algorithms and was statistically even indistinguishable from            [7] R. Klamma, P. Cuong, and Y. Cao. You never walk
the random recommender for most of the questions except                     alone: Recommending academic events based on social
for again the diversity question (Q5). A histogram of the                   network analysis. Complex Sciences, pages 657–670,
values for SVD and question Q8 puts this into context by re-                2009.
vealing an almost black and white opinion pattern expressed
                                                                        [8] D. Lee. Pittcult: trust-based cultural event
by the users in the experiment.
                                                                            recommender. In Proceedings of the 2008 ACM
                                                                            conference on Recommender systems, pages 311–314.
6.   FUTURE WORK                                                            ACM, 2008.
   While we were able to investigate numerous di↵erent qual-            [9] S. McNee, J. Riedl, and J. Konstan. Being accurate is
itative aspect about each algorithm individually, the exper-                not enough: how accuracy metrics have hurt
iment did not allow us, apart from indicating a best and                    recommender systems. In CHI’06 extended abstracts
worst algorithm, to construct an overall ranking of the rec-                on Human factors in computing systems, page 1101.
ommendation algorithms. Each user ended up evaluating                       ACM, 2006.
just one algorithm. As our future work, we intend to extend            [10] P. Pu and L. Chen. A user-centric evaluation
this experiment with a focus group allowing to elaborate on                 framework of recommender systems. In Proc. ACM
the reasoning behind some of the answers users provided and                 RecSys 2010 Workshop on User-Centric Evaluation of
compare subjective rankings of the algorithms.                              Recommender Systems and Their Interfaces
   We also plan to extend our regression analysis to come up                (UCERSTI), 2010.
with a causal path model that will allow us to have a better           [11] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, and
understanding as to how the di↵erent algorithms influence                   M. U. M. D. O. C. SCIENCE. Application of
the overall satisfaction.                                                   dimensionality reduction in recommender system-a
                                                                            case study. Citeseer, 2000.
7.   ACKNOWLEDGMENTS
   The research activities that have been described in this pa-
per were funded by a PhD grant to Simon Dooms of the In-
stitute for the Promotion of Innovation through Science and
Technology in Flanders (IWT Vlaanderen) and a PhD grant
to Toon De Pessemier of the Fund for Scientific Research-
Flanders (FWO Vlaanderen). We would like to thank Cul-
tuurNet Vlaanderen for the e↵ort and support they were
willing to provide for deploying the experiment described in
this paper.

8.   REFERENCES
 [1] D. Bollen, B. Knijnenburg, M. Willemsen, and
     M. Graus. Understanding choice overload in
     recommender systems. In Proceedings of the fourth
     ACM conference on Recommender systems, pages
     63–70. ACM, 2010.
 [2] E. Campochiaro, R. Casatta, P. Cremonesi, and
     R. Turrin. Do metrics make recommender algorithms?
     In Proceedings of the 2009 International Conference
     on Advanced Information Networking and Applications
     Workshops, WAINA ’09, pages 648–653, Washington,
     DC, USA, 2009. IEEE Computer Society.
 [3] C. Cornelis, X. Guo, J. Lu, and G. Zhang. A fuzzy
     relational approach to event recommendation. In




                                                                  73