FULL PAPER Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010 Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf A User-Centric Evaluation Framework of Recommender Systems Pearl Pu Li Chen Human Computer Interaction Group Department of Computer Science Swiss Federal Institute of Technology (EPFL) Hong Kong Baptist University CH-1015, Lausanne, Switzerland 224 Waterloo Road, Hong Kong Tel: +41-21-6936081 Tel: +852-34117090 pearl.pu@epfl.ch lichen@comp.hkbu.edu.hk ABSTRACT behavior or their explicitly stated preferences. It is no longer a User experience research is increasingly attracting researchers’ fanciful website add-on, but a necessary component. According to attention in the recommender system community. Existing works the 2007 ChoiceStream survey,1 45% of users are more likely to in this area have suggested a set of criteria detailing the shop at a website that employs recommender technology. characteristics that constitute an effective and satisfying Furthermore, a higher percentage (69%) of users in the highest recommender system from the user’s point of view. To combine spending category are more likely to desire the support of these criteria into a more comprehensive framework which can be recommendation technology. used to evaluate the perceived qualities of recommender systems, Characterizing and evaluating the quality of user experience and we have developed a model called ResQue (Recommender users’ subjective attitudes toward the acceptance of recommender systems’ Quality of user experience). ResQue consists of 13 technology is an important issue which merits attention from constructs and a total of 60 question items, and it aims to assess researchers and practitioners in both web technology and human the perceived qualities of recommenders such as their usability, factor fields. This is because recommender technology is usefulness, interface and interaction qualities, users’ satisfaction becoming widely accepted as an important component that of the systems, and the influence of these qualities on users’ provides both user benefits and enhances the website’s revenue. behavioral intentions, including their intention to purchase the For users, the benefits include more efficiency in finding products recommended to them, return to the system in the future, preferential items, more confidence in making a purchase decision, and tell their friend about the system. This model thus identifies and a potential chance to discover something new. For the the essential qualities of an effective and satisfying recommender marketer, this technology can significantly enhance user system and the essential determinants that motivate users to adopt likelihood to buy the items recommended to them, their overall this technology. The related questionnaire can be further adapted satisfaction and loyalty, increasing users’ likelihood to return to for a custom-made user evaluation or combined with objective the site and recommend the site to their friends. Thus, evaluating performance measures. We also propose a simplified version of user’s perception of a recommender system can help developers the model with 15 questions which can be employed as a usability and marketers understand more precisely if users actually questionnaire for recommender systems. experience and appreciate the intended benefits. This will, in turn, help improve the various aspects of the system and more Categories and Subject Descriptors accurately predict the adoption of a particular recommender. H1.2 [User/Machine Systems]: Human factors; H5.2 [User So far, previous research work on recommender system Interfaces]: evaluation/methodology, user-centered design. evaluation has mainly focused on algorithm accuracy [9,1], General Terms especially objective prediction accuracy [25,26]. More recently, Measurement, Experimentation, Human Factors. researchers began examining issues related to users’ subjective opinions [30, 13] and developing additional criteria to evaluate Keywords recommender systems [18, 33]. In particular, they suggest that Quality measurement, usability evaluation, recommender systems, user satisfaction does not always correlate with high recommender quality of user experience, e-Commerce recommender, post-study accuracy. Increasingly, researchers are investigating user questionnaire, evaluation of decision support. experience issues such as identifying determinants that influence users’ perception of recommender systems [30], effective 1. INTRODUCTION preference elicitation methods [19], techniques that motivate users A recommender system is a web technology that proactively to rate items that they have experienced [2], methods that generate suggests items of interest to users based on their objective diverse and more satisfying recommendation lists [43], explanation interfaces [31], trust formation with recommenders [6], and design guidelines for enhancing a recommender’s Permission to make digital or hard copies of all or part of this work for interface layout [22]. However, the field lacks a general definition personal or classroom use is granted without fee provided that copies are and evaluation framework of what constitutes an effective and not made or distributed for profit or commercial advantage and that satisfying recommender system from the user’s perspective. copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. UCERSTI Workshop of RecSys’10, Sept. 26-30, 2010, Barcelona, Spain. 1 2007 ChoiceStream Personalization Survey, ChoiceStream, Inc. Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D. 14 FULL PAPER Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010 Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf Our present work aims to review existing usability-oriented faster) and satisfaction (increase the ease of use and enjoyment). evaluation research in the field of recommender systems to These aims are very similar to the set of criteria that we have identify essential determinants that motivate users to adopt this developed in ResQue, except the fact that we focus more on the technology. We then apply well-known usability evaluation system as a whole rather than just the explanation component. models, including TAM [7] and SUMI [15], in order to develop a Ozok et al. [22] explored recommender systems’ usability and more balanced framework. The final model, which we call user preferences from both the structural (how recommender ResQue, consists of 13 constructs and a total of 60 question items systems should look) and content (what information recommender categorized into four main dimensions: the perceived system systems should contain) perspectives. A two-layer interface qualities, users’ beliefs as a result of these qualities, their usability evaluation model including both micro- and macro-level subjective attitudes, and their behavioral intentions. The structure interface evaluations was proposed, followed by a Survey on and criteria of our framework is derived on the basis of three Usability of E-Commerce Recommender Systems (SUERS). The essential characteristics of recommender systems: 1) being an survey was administered on 131 college-aged online shoppers to interaction-driven application and a critical part of online e- measure and rank the importance of structural and content aspects commerce services, 2) providing information filtering technology of recommender systems from the shoppers’ perspectives. The and suggesting recommended items, and 3) providing decision main result was a set of 14 design guidelines. The micro-level of support technology for the users. the guidelines provided suggestions specific to the recommended The main contribution of this paper is the development of a well- product such as what attributes (name, price, image, description, balanced evaluation framework for measuring the perceived rating, etc.) to include in the interface. The macro-level of the qualities of a recommender and predicting users’ behavioral guidelines provided suggestions concerning when, where and how intentions as a result of these qualities. Thus, it is a forecasting the recommended products should be displayed. The development model that helps us understand users’ motivation in adopting a process of the model was limited, as authors did not go through an certain recommender. Secondly, the framework aims to help iterative process of the evaluation and refinement of the model. designers and researchers easily perform a usability and user Instead, it was purely based on a literature survey of quite limited acceptance test during any stage of the design and deployment past work of subjective evaluations of recommender system. Most phase of a recommender. These usability tests can be performed importantly, it failed to explain how usability issues influence either on a stand-alone basis or as a post-study questionnaire. The users’ behavioral intentions such as their intention to buy the model can be further combined with measurements that address items recommended to them, whether they will continue using the other perceived qualities of a recommender, such as security and system and recommend the system to their friends. robustness issues. For those who are interested in a quick usability Jones and Pu [13] presented the first significant user study that evaluation, we also propose a simplified version of the model with aimed to understand users’ initial adoption of the recommender 15 questions. technology and their subjective perceptions of the system. Study results show that a simple interface design, a small amount of 2. EVALUATION WORK FROM USERS’ initial effort required by the system to get to know the users, the POINT OF VIEW perceived qualities such as the subjective accuracy, novelty and Swearingen and Sinha [38] conducted a user study on eleven enjoyability of the recommended items are the key design factors recommender systems in order to understand and discover that significantly enhance the website’s ability to attract users. influential factors, other than algorithm accuracy, that affect users’ perception. The main results are that transparent system 3. MODEL DEVELOPMENT logic, recommendation of familiar items, and sufficient supporting A measurement model consists of a set of constructs, the information to recommended items is crucial in influencing users’ participating questions for each construct, the scale’s dimensions, favorable perception towards the system. They also highlighted and a procedure for conducting the questionnaire. Psychometric that trust and willingness to purchase should be noted. In addition, questionnaires such as the one proposed in this paper require the the users’ appreciation of online recommendations is compared validation of the questions used, data gathering, and statistical with that of recommendations from their friends, defining the analysis before they can be used with confidence. The current notion of relative accuracy. model and its constructs were based on our past work in McNee et al. [20] pointed out that accuracy metrics alone and the investigating various interface and interaction issues between commonly employed leave-one-out procedure was very limited in users and recommenders. In over 10 user studies, we have evaluating recommender systems. User satisfaction does not carefully and progressively developed and employed user always correlate with high recommender accuracy. Metrics are satisfaction questionnaires to evaluate recommenders’ perceived needed to determine good and useful recommendations, such as qualities such as ease of use, perceived usefulness and users’ the serendipity, salience, and diversity of the recommended items. satisfaction and behavioral intentions [4,5,6,12,13,14,23,24]. This past research has given us a unique opportunity to synthesize and Tintarev and Masthoff provided a comprehensive survey of the organize the accumulation of existing questionnaires and develop explanation functionality used in ten academic and eight a well-balanced framework. commercial recommenders [31]. They derived seven main aims of the explanation facility which can help a recommender In the model development process, we also compare our significantly enhance users’ satisfaction: transparency (explains constructs with those used in TAM and SUMI, two well-known why recommendations were generated), scrutability (the ability and widely adopted measurement frameworks. for the user to critique the system), trust (increase users’ TAM (Technology Acceptance Model) seeks to understand a set confidence in the system), effectiveness (help users make good of perceived qualities of a system and users’ intention to adopt the decisions), persuasiveness (convince users to try or buy items system as a result of these qualities, thus explaining not only the recommended to them), efficiency (help users make decisions desirable outcome of a system but also users’ motivation. The Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D. 15 FULL PAPER Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010 Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf original TAM listed three constructs: perceived ease of use of a 3.1 Perceived System Qualities system, its perceived usefulness and users’ intention to use the This construct refers to the functional and informational aspect of system. However, TAM was also criticized for its over-simplicity a recommender and how the perceived qualities of these aspects and generality. Venkatesh et al. [32] formulated an updated influence users’ beliefs on the ease of use, usefulness and version of TAM, called the Unified Theory of Acceptance and control/transparency of a system. A recommender system is not Use of Technology. In this more recent theory, four key constructs simply part of a website, but more importantly a decision support (performance expectancy, effort expectancy, social influence, and tool. We focus on three essential dimensions: the quality of the facilitating conditions) were presented as direct determinants of recommended items, the interaction adequacy and the interface usage intentions and behaviors. adequacy as the recommender helps users reach a purchase SUMI (Software Usability Measurement Inventory) is a decision. psychometric evaluation model developed by Kirakowski and Corbett [15] to measure the quality of software from the end- 3.1.1 Quality of Recommended Items user’s point of view. The model consists of 5 constructs The items proposed by a recommender can be considered one of (efficiency, affect, helpfulness, control, learnability) and 50 the main features of the system. Qualities refer to the information questions. It is widely used to help designers and developers quality and genuine usefulness of the suggested items. Presented assess the quality of use of a software product or prototype and as a collection of articles, the recommended items are often can assist with the detection of usability flaws and the comparison labeled and presented in a certain area of the recommender page. between software products. Some systems also propose grouping them into meaningful subareas to increase users’ comprehension of the list and enable By adapting our past work to the TAM and SUMI models, we them to more effectively reach decisions [4]. In our earlier work, have identified 4 essential constructs of ResQue for a successful we have found strong correlations of the following qualities of the recommender system to fulfill from the users’ point of view: 1) recommended items to users’ intention to use the system. user perceived qualities of the system, 2) user beliefs as a result of these qualities in terms of ease of use, usefulness and control, 3) Perceived accuracy is the degree to which users feel the their subjective attitudes, and 4) their behavioral intentions. Figure recommendations match their interests and preferences. It is an 1 depicts the detailed schema of the constructs of ResQue and overall assessment of how well the recommender has understood some of the scales for each construct. the users’ preferences and tastes. This subjective measure is significantly easier to obtain than the measure of objective accuracy that we used in our earlier work [23]. Our studies show that they are strongly correlated [6]. In other words, if users respond well to this question, it is likely that the underlying algorithm is accurate in predicting users’ interest. In addition, it is useful to use relative accuracy to compare the difference between recommendations a user may get from a system versus those from friends [28]. It can serve as a useful complement to perceived accuracy because it implicitly sets up friends’ recommendation quality as a baseline. Familiarity describes whether or not users have previous knowledge of, or experience with, the items recommended to them. Swearingen and Sinha [30] indicated that users like and prefer to get recommendations of previously experienced items because their presence reinforces trust in the recommender system. Figure 1: Constructs of an Evaluation Framework on the However, users can be frustrated by too much familiarity. Perceived Qualities of Recommenders (ResQue). Therefore, it is important to know whether or not a recommender When administering the questionnaires, we assume that a website has achieved the proper balance of familiarity and novelty recommender system being evaluated is part of an online system. from the users’ perspective. To make the evaluation more focused on the recommender Novelty (or discovery) is the extent to which users receive new component, we often give subjects a specific task: “find an ideal and interesting recommendations. The core concept of novelty is product to buy/experience from an online site” where the related to the recommender’s ability to educate users and help recommender in question is a constituent component. them discover new items [24]. In [20], a similar concept, called In the following sections, the meaning of each scale as well as its “serendipity”, was suggested. Herlocker [11] argued that novelty subscales is defined and explained, and the sample questions that is different from serendipity, because novelty only covers the can be used in a questionnaire are suggested in the appendix at the concept of “new” while serendipity means not only “new” but end of the paper. It is a common practice in questionnaire also “surprising”. However, in conducting the actual user development to vary the tone of items to control potential evaluation procedure, the meticulous distinction between these response biases. Typically some of the items elicit agreement and two words will cause confusion for users. Therefore, we suggest others elicit disagreement. For some of the items, therefore, we novelty and discovery as two similar questions. More user trials also suggest reverse scale questions. A 5-point Likert scale from will be needed to further delineate the serendipity question. “strongly disagree” (1) to “strongly agree” (5) is recommended to The Attractiveness of the recommended items refers to whether characterize users’ responses. or not recommended items are capable of stimulating users’ imagination and evoking a positive emotion of interest or desire. Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D. 16 FULL PAPER Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010 Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf Attractiveness is different from accuracy and novelty. An item can or bad, while the more sophisticated ones show users a set of be accurate and novel, but not necessarily attractive; a novel item alternative items that take into account users’ desire for these is different from anything a user has ever experienced, whereas an items and the potential superior values they offer, such as better attractive item stimulates the user in a positive manner. This price, more popularity, etc [6]. concept is similar to the salience factor in [20]. The final interaction quality being measured is the system’s While judging novelty requires a user to think more about the ability to explain the recommended results. Herlocker et al. [10], distinguishing factors of an item, the aspect of attractiveness Sinha and Swearingen [30] and Tintarev and Masthoff [31] brings to mind the outstanding quality of an item and has a more demonstrated that a good explanation interface could help inspire emotional tone to it. users’ trust and satisfaction by giving them information to The enjoyability of recommended items refers to whether users personally justify recommendations, increasing user involvement have enjoyed experiencing the items suggested to them. It was and educating users on the internal logic of the system [10, 31]. In found to have a significant correlation to users’ intention to use addition, Tintarev and Masthoff [31] defined in detail possible and return to the system [13]. This is the only scale that assesses a aims of explanation facilities: transparency, scrutability, trust, user’s actual experience of a recommender. In many online study effectiveness, persuasiveness, efficiency, and satisfaction. Pu and scenarios, it is not possible to immediately measure enjoyability Chen extensively investigated design guidelines for developing unless users are told to answer a questionnaire after a few weeks explanation-based recommender interfaces [4]. They found that when they have actually received and experienced the item. In organization interfaces are particularly effective in promoting testing music or film recommenders, it is possible to allow users users’ satisfaction of the system, convincing them to buy items to answer this question if they are given the opportunity to listen recommended to them, and bringing them back to the store in the to a song excerpt or watch a movie trailer. future. Diversity measures the diversity level of items in the 3.1.3 Interface Adequacy recommendation list. As the recommendation list is the first piece Interface design issues related to recommenders have also been of information users will encounter before they examine the extensively investigated in [10, 20, 31,22]. Most of the existing details of an individual recommendation, users’ impression of this work is concerned with how to optimize the recommender page list is important for their perception of the whole system. At this layout to achieve the maximum visibility of the recommendation, stage, it has been found that a low diversity level might disappoint i.e. whether to use image, text, or a combination of the two. A users and could cause them to leave this recommender [13]. detailed set of design guidelines were investigated and proposed McGinty and Smyth [17] proposed integrating diversity with [22]. In our current model, we mainly emphasize users’ subjective similarity in order to adaptively select the appropriate strategy evaluations of a recommender interface in terms of its information (either similar or diverse ones) given each individual user’s past sufficiency, the interface label and layout adequacy and clarity. behavior and current needs. Literature also suggests that a recommendation list as a complete entity should be judged for its 3.2 Beliefs diversity rather than treating each recommendation as an isolated item [33]. 3.2.1 Perceived Ease of Use Perceived ease of use, also known as efficiency in SUMI and Context compatibility evaluates whether or not the perceived cognitive effort in our existing work [6,14], measures recommendations consider general or personal context users' ability to quickly and correctly accomplish tasks with ease requirements. For example, for a movie recommender, the and without frustration. We also use it to refer to decision necessary context information may include a user’s current mood, efficiency, i.e. the extent to which a recommender system different occasions for watching the movie, whether or not other facilitates users to find their preferential items quickly. Although people will be present, and whether the recommendation is timely. task completion and learning time can be measured objectively, it A good recommender system should be able to formulate can be difficult to distinguish the actual task completion time from recommendations considering different kinds of contextual factors the measured task time for various reasons. Users can be that will likely take effect. exploring the website and discovering information unrelated to the 3.1.2 Interaction Adequacy assigned task. This is especially true if a system is entertaining Besides issues related to the quality of recommended items, the and educational, and its interface and content is very appealing. It is also possible that the user perceives that he/she has consumed system’s ability to present recommendations, to allow for user less time while the measured task completion time is in fact high. feedback and to explain the reasons why recommendations Therefore, evaluating perceived ease of use may be more facilitate purchasing decisions also weighs highly on users’ overall perception of a recommender. Thus, three main interaction appropriate than using the objective task completion time to measure a system’s ease of use. mechanisms are usually suggested in various recommenders: initial preference elicitation, preference revision, and the system’s Besides the overall perceived ease of use, perceived initial effort ability to explain its results. Behavioral based recommenders do should also be taken into account, given the new user problem. not require users to explicitly indicate their preferences, but Perceived initial effort is the perceived effort users contribute to collect such information via users’ browsing and purchasing the system before they get the first set of recommendations. The history. For rating and preference based recommenders, this initial effort could be spent on rating items [19], specifying process requires a user to rate a set of items or state their preferences, or answering personality quizzes [12]. Theoretically preferences on desired items in a graphical user interface [23]. speaking, recommender systems should try to minimize the effort Some conversational recommenders provide explicit mechanisms users expend for a good recommendation [30]. for users to provide feedback in the form of critiques [6]. The simplest critiques indicate whether the recommended item is good Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D. 17 FULL PAPER Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010 Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf Easy to learn, known as “learnability” in SUMI, initially appears accuracy of a recommendation is dependent on whether or not the to be an inadequate dimension since most recommenders require a user sees a correspondence between the preferences expressed in minimal amount of learning by design. However, since some the measurement process and the recommendation presented by users may not initially notice the recommended items or know the system. exactly what they were intended for, especially without clear labels or explicit explanations on the interface, the learning aspect 3.3 Attitudes should be included to measure the level of ease for users to Attitude is a user’s overall feeling towards a recommender, which discover the recommended items. In addition, some is most likely derived from his/her experience as she interacts recommenders, such as critiquing-based recommenders, do allow with a recommender. An attitude is generally believed to be more users to provide feedback to increase the personalization of the long-lasting than a belief. Users’ attitudes towards a recommender recommender. In this case, the learning construct measures how are highly influential on their subsequent behavioral intentions. easy it is for users to alter their personal profile information in Many researchers attribute positive attitudes, including users’ order to receive different recommendations. satisfaction and trust of a recommender, as important factors. Evaluating overall satisfaction determines what users think and 3.2.2 Perceived Usefulness feel while using a recommender system. It gives users an Perceived usefulness of a recommender (called perceived opportunity to express their preferences and opinions about a competence in our previous work) is the extent to which a user system in a direct way. Confidence inspiring refers to the finds that using a recommender system would improve his/her recommender’s ability to inspire confidence in users, or its ability performance, compared with their previous experiences without to convince users of the information or products recommended to the help of a recommender [4]. This element requests users’ them. Trust indicates whether or not users find the whole system opinion as to whether or not this system is useful to them. Since trustworthy. Studies show that consumer trust is positively recommenders used in e-commerce environments mainly assist associated with their intentions to transact, purchase a product, users in finding relevant information to support their purchase and return to the website [8]. The trust level is determined by the decision, we further qualify the usefulness in two aspects: reputation of online systems [8], as well as the recommender decision support and decision quality. system’s ability to formulate good recommendations and provide Recommender technology provides decision support to users in useful explanation interfaces [4,10,19]. However, as trust is a the process of selecting preferential items, for example making a long-term relationship between a user and an online system, it is purchase in an e-commerce environment. The objective of sometimes difficult to measure trust purely after a short-period decision technologies in general is to overcome the limit of users’ interaction with a system. Thus, we recommend observing the bounded rationality and to help them make more satisfying trust formation over time, as users are incrementally exposed to decisions with a minimal amount of effort [29]. Recommender the same recommender. systems specifically help users manage an overwhelming flood of information and make high-quality decisions under limited time 3.4 Behavioral Intentions and knowledge constraints. Decision support thus measures the Behavioral intentions towards a system is related to whether or extent to which users feel assisted by the recommended system. not the system is able to influence users’ decision to use the In addition to the efficiency of decision making, the quality of the system and purchase some of the recommended results. decision (decision quality) also matters. The quality of a system- One of the fundamental goals for an e-commerce website is to facilitated decision can be assessed by confidence criterion, which maximize user loyalty and the lifetime value to stimulate users’ is the level of a user’s certainty in believing that he/she has made future visits and purchases. User loyalty evaluates the system’s a correct choice with the assistance of a recommender. ability to convince users to reuse the system, or persuade them to introduce the system to their friends in order to increase the 3.2.3 Control and Transparency number of clients. Accordingly, this dimension consists of the User control measures whether users felt in control in their following criteria: user agreement to use the system, user interaction with the recommender. The concept of user control acceptance of the recommended items (resulting in a purchase), includes the system’s ability to allow users to revise their user retention and intention to introduce this system to her/his preferences, to customize received recommendations, and to friends. By using a questionnaire, the user’s intention to return request a new set of recommendations. This aspect weighs heavily can be measured as a satisfactory approximation of actual user in the overall user experience of the system. If the system does not retention, because the Theory of Planned Behavior [32] states that provide a mechanism for a user to reject recommendations that behavioral intention can be a strong predictor of actual behavior. he/she dislikes, a user will be unable to stop the system from Although the website’s integrity, reputation and price quality will continuously recommending items which might cause him/her to also likely impact user loyalty, the most important factor for a be disappointed with the system. recommender system is to help users effectively find a satisfying Transparency determines whether or not a system allows users to product, i.e. the quality of its recommendations [7]. understand its inner logic, i.e. why a particular item is recommended to them. A recommender system can convey its 4. SIMPLIFIED MODEL inner logic to the user via an explanation interface [4,10,30,31]. In the previous sections, we described the development process of To date, many researchers have emphasized that transparency has a subjective evaluation framework to measure users’ perceived a certain impact on other critical aspects of users’ perception. qualities of a recommender as well as users’ behavioral intentions Swearingen and Sinha [30] showed that the more transparent a such as their intention to buy or use the items suggested to them, recommended product is, the more likely users would be to continue to use the system, and tell their friends about the purchase it. In addition, Simonson [27] suggested that perceived recommender. We described both the constructs and corresponding sample questions (see Appendix A for a summary). Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D. 18 FULL PAPER Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010 Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf Our overall motivation for this research was to understand the  The recommender gave me good suggestions. crucial factors that influence the user adoption of recommenders.  I am not interested in the items recommended to me (reverse Another motivation is to come up with a subjective evaluation scale). questionnaire that other researchers and practitioners can employ. However, it is unlikely that a 60-item questionnaire can be A.1.2 Relative Accuracy administered for a quick and easy evaluation. This has motivated  The recommendation I received better fits my interests than us in proposing a simplified model based on our past research. what I may receive from a friend. Between 2005 and 2010, we have administered 11 subjective  A recommendation from my friends better suits my interests questionnaires on a total of 807 subjects [4,5,6,12,13,14,23,24]. than the recommendation from this system (reverse scale). Initial questionnaires covered some of the four categories identified in the ResQue. As we conducted more experiments, we A.1.3 Familiarity became more convinced of the four categories and used all of them in recent studies. On average, between 12 and 15 questions  Some of the recommended items are familiar to me. were used. Based this previous work, we have synthesized and  I am not familiar with the items that were recommended to me organized a total of 15 questions as a simplified model for the (reverse scale). purpose of performing a quick and easy usability and adoption evaluation of a recommender (see questions with * sign). A.1.4 Attractiveness  The items recommended to me are attractive. 5. CONCLUSION AND FUTURE WORK User evaluation of recommender systems is a crucial subject of A.1.5 Enjoyability study that requires a deep understanding, development and testing  I enjoyed the items recommended to me. of the right dimensions (or constructs) and the standardization of the questions used. The framework described in this paper A.1.6 Novelty presents the first attempt to develop a complete and balanced  The items recommended to me are novel and interesting.* evaluation framework that measures users’ subjective attitudes  The recommender system is educational. based on their experience towards a recommender.  The recommender system helps me discover new products. ResQue consists of a set of 13 constructs and 60 questions for a  I could not find new items through the recommender (reverse high-quality recommender system from the user point of view and scale). can be used as a standard guideline for a user evaluation. It can also be adapted to a custom-made user evaluation by tailoring it in A.1.6 Diversity an individual research context. Researchers and practitioners can  The items recommended to me are diverse.* use these questionnaires with ease to measure users’ general  The items recommended to me are similar to each other satisfaction with recommenders, their readiness to adopt the (reverse scale).* technology, and their intention to purchase recommended items and return to the site in the future. A.1.7 Context Compatibility After ResQue was finalized, we asked several expert researchers  I was only provided with general recommendations. in the community of recommender systems to review the model.  The items recommended to me took my personal context Their feedback and comments were then incorporated into the requirements into consideration. final version of the model. This method, known as the Delphi  The recommendations are timely. method, is one of the first validation attempts on the model. Since the work was submitted, we have started conducting a survey to further validate the model’s reliability, validity and sensitivity A2. Interaction Adequacy using factor analysis, structural equation modeling (SEM), and  The recommender provides an adequate way for me to express other techniques described in [21]. Initial results based on 150 my preferences. participants indicate how the model can be interpreted and show  The recommender provides an adequate way for me to revise factors that correspond to the original model. At the same time, my preferences. analysis also gives some indications on how to refine the model.  The recommender explains why the products are More users are expected to participate in the survey and the final recommended to me.* outcome will be soon reported. APPENDIX A3. Interface Adequacy A. Constructs and Questions of ResQue  The recommender’s interface provides sufficient information.  The information provided for the recommended items is The following contains the questionnaire statements that can be sufficient for me. used in a survey. They are developed based on the ResQue model described in this paper. Users should be asked to indicate their  The labels of the recommender interface are clear and answers to each of the questions using the 1-5 Likert scales, where adequate. 1 indicates “strongly disagree” and 5 is “strongly agree.”  The layout of the recommender interface is attractive and adequate.* A1. Quality of Recommended Items A.1.1 Accuracy A4. Perceived Ease of Use  The items recommended to me matched my interests.* A.4.1 Ease of Initial Learning Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D. 19 FULL PAPER Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010 Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf  I became familiar with the recommender system very quickly.  The recommender made me more confident about my  I easily found the recommended items. selection/decision.  Looking for a recommended item required too much effort  The recommended items made me confused about my choice (reverse scale). (reverse scale).  The recommender can be trusted. A.4.2 Ease of Preference Elicitation  I found it easy to tell the system about my preferences. A8. Behavioral Intentions  It is easy to learn to tell the system what I like. A.8.1 Intention to Use the System  If a recommender such as this exists, I will use it to find  It required too much effort to tell the system what I like products to buy. (reversed scale). A.8.2 Continuance and Frequency A.4.3 Ease of Preference Revision  I will use this recommender again.*  I found it easy to make the system recommend different things  I will use this type of recommender frequently. to me.  I prefer to use this type of recommender in the future.  It is easy to train the system to update my preferences. A.8.3 Recommendation to Friends  I found it easy to alter the outcome of the recommended items due to my preference changes.  I will tell my friends about this recommender.*  It is easy for me to inform the system if I dislike/like the A.8.4 Purchase Intention recommended item.  I would buy the items recommended, given the opportunity.*  It is easy for me to get a new set of recommendations. 6. REFERENCES [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next A.4.4 Ease of Decision Making Generation of Recommender Systems: A Survey of the State-  Using the recommender to find what I like is easy. of-the-Art and Possible Extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734-749.  I was able to take advantage of the recommender very quickly.  I quickly became productive with the recommender. [2] Beenen, G., Ling, K., Wang, X., Chang, K., Frankowski, D.,  Finding an item to buy with the help of the recommender is Resnick, P., et al. 2004. Using social psychology to motivate easy.* contributions to online communities. In CSCW '04:  Finding an item to buy, even with the help of the Proceedings of the ACM Conference On Computer recommender, consumes too much time. Supported Cooperative Work. New York: ACM Press. [3] Castagnos, S., Jones, N., and Pu, P. 2009. Recommenders' A5. Perceived Usefulness Influence on Buyers' Decision Process. In proceedings of the 3rd ACM Conference on Recommender Systems (RecSys  The recommended items effectively helped me find the ideal 2009), 361-364. product.*  The recommended items influence my selection of products. [4] Chen, L. and Pu, P. 2006. Trust Building with Explanation  I feel supported to find what I like with the help of the Interfaces. In Proceedings of International Conference on recommender.* Intelligent User Interface (IUI’06), 93-100.  I feel supported in selecting the items to buy with the help of [5] Chen, L. and Pu, P. 2008. A Cross-Cultural User Evaluation the recommender. of Product Recommender Interfaces. RecSys 2008, 75-82. [6] Chen, L. and Pu, P. 2009. Interaction Design Guidelines on A6. Control/Transparency Critiquing-based Recommender Systems. User Modeling and  I feel in control of telling the recommender what I want. User-Adapted Interaction Journal (UMUAI), Springer  I don’t feel in control of telling the system what I want. Netherlands, Volume 19, Issue3, 167-206.  I don’t feel in control of specifying and changing my [7] Davis, F.D. 1989. Perceived usefulness, perceived ease of preferences (reverse scale). use, and user acceptance of information technology. MIS  I understood why the items were recommended to me. Quart. 13 319-339.  The system helps me understand why the items were [8] Grabner-Kräuter, S. and Kaluscha, E.A. 2003. Empirical recommended to me. research in on-line trust: a review and critical assessment Int.  The system seems to control my decision process rather than J. Hum.-Comput. Stud. (IJMMS) 58(6), 783-812. me (reverse scale). [9] Herlocker, J.L., Konstan, J.A., Borchers, A., and Riedl, J. An A7. Attitudes algorithmic framework for performing collaborative filtering. In Proc. of ACM SIGIR 1999, ACM Press (1999), 230-237.  Overall, I am satisfied with the recommender.*  I am convinced of the products recommended to me.* [10] Herlocker, J.L., Konstan, J.A., and Riedl, J. 2000. Explaining  I am confident I will like the items recommended to me. * collaborative filtering recommendations. CSCW 2000, 241- 250. Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D. 20 FULL PAPER Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010 Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf [11] Herlocker, J.L., Konstan, J.A., Terveen, L.G., and Riedl, J. [23] Pu, P., Chen, L., and Kumar, P. 2008. Evaluating Product 2004. Evaluating collaborative filtering recommender Search and Recommender Systems for E-Commerce systems. ACM Trans. Inf. Syst. 22(1), 5-53. Environments. Electronic Commerce Research Journal, 8(1- [12] Hu, R. and Pu, P. Potential Acceptance Issues of Personality- 2), June,1-27. based Recommender Systems. In Proceedings of ACM [24] Pu, P., Zhou, M., and Castagnos, S. 2009. Critiquing Conference on Recommender Systems (RecSys'09), New Recommenders for Public Taste Products. In proceedings of York City, NY, USA, October 22-25, 2009. the 3rd ACM Conference on Recommender Systems [13] Jones, N., and Pu, P. 2007. User Technology Adoption Issues (RecSys 2009), 249-252. in Recommender Systems. In Proceedings of Networking [25] Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. 2000. and Electronic Commerce Research Conference Analysis of recommendation algorithms for e-commerce. (NAEC2007), 379-394. ACM Conference on Electronic Commerce, 158-167. [14] Jones, N., Pu, P., and Chen, L. 2009. How Users Perceive [26] Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. 2001. and Appraise Personalized Recommendations. Proceedings Item-based collaborative filtering recommendation of User Modeling, Adaptation, and Personalization algorithms. WWW, 285-295. conference (UMAP09), 461-466. [27] Simonson, I. 2005. Determinants of customers’ responses to [15] Kirakowski, J. 1993. SUMI: the Software Usability customized offers: Conceptual framework and research Measurement Inventory. British Journal of Educational propositions. Journal of Marketing, 69 (January 2005), 32–45. Technology, 24 (3) 210-214. [28] Sinha, R. and Swearingen, K. 2001. Comparing [16] Lewis, J.R. 1993. IBM computer usability satisfaction Recommendations made by Online Systems and Friends. questionnaires: psychometric evaluation and instructions for Proceedings of the DELOS-NSF Workshop on use. Personalization and Recommender Systems in Digital [17] McGinty, L. and Smyth, B. On the role of diversity in Libraries, 2001. conversational recommender systems. In Proceedings of the [29] Stohr, E.A. and Viswanathan, S. 1999. Recommendation Fifth International Conference on Case-Based Reasoning systems: Decision support for the information economy. (ICCBR’03), 2003, 276-290. Emerging Information Technologies, K. E. Kendall, Ed. [18] McNee, S.M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, Thousand Oaks, CA: SAGE, 1999, 21-44. S.K., Rashid, A.M., Konstan, J.A. and Riedl, J. On the [30] Swearingen, K. and Sinha, R. 2002. Interaction design for Recommending of Citations for Research Papers. In Proc. of recommender systems. In Interactive Systems (DIS2002). ACM CSCW 2002, ACM Press (2002), 116-125. [31] Tintarev, N. and Masthoff, J. 2007. Survey of explanations in [19] McNee, S.M., Lam, S.K., Konstan, J.A., Riedal, J. 2003. recommender systems. ICDE Workshops 2007, 801-810. Interfaces for eliciting new user preferences in recommender [32] Venkatesh,V., Morris, M.G., Davis, G.B. and Davis, F.D. systems. User Modeling 2003, 178-187. 2003. User acceptance of information technology: Toward a [20] McNee, S.M., Riedl, J., and Konstan, J.A. 2006. Being unified view. MIS Quarterly, 2003, 27, 3, 425-478. accurate is not enough: How accuracy metrics have hurt [33] Ziegler, C.N., McNee, S.M., Konstan, J.A., and Lausen, G., recommender systems. CHI Extended Abstracts 2006,1097- Improving Recommendation Lists through Topic 1101. Diversification. In Proc. of WWW 2005, ACM Press (2005), [21] Nunnally, J. C. 1978. Psychometric Theory. 22-32. [22] Ozok, A.A, Fan, Q., Norcio, A.F. 2010. Design guidelines for effective recommender system interfaces based on a usability criteria conceptual model: results from a college student population. Behaviour & Information Technology, Volume 29, Issue 1, 57 - 83. Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D. 21