Effects of Online Recommendations on Consumers’ Willingness to Pay Gediminas Adomavicius Jesse Bockstedt Shawn Curley Jingjing Zhang University of Minnesota University of Arizona University of Minnesota Indiana University Minneapolis, MN Tucson, AZ Minneapolis, MN Bloomington, IN gedas@umn.edu bockstedt@email.arizo curley@umn.edu jjzhang@indiana.edu na.edu ABSTRACT we extend these results and find strong evidence that these effects We present the results of two controlled behavioral studies on the still exist with real recommendations generated by a live real-time effects of online recommendations on consumers’ economic recommender system. The results of the second study behavior. In the first study, we found strong evidence that demonstrate that errors in recommendation, a common feature of participants’ willingness to pay was significantly affected by live recommender systems, can significantly impact the economic randomly assigned song recommendations, even when controlling behaviors of consumers toward the recommended products. for participants’ preferences and demographics. In the second study, we presented participants with actual system-generated 2. LITERATURE REVIEW AND recommendations that were intentionally perturbed (i.e., HYPOTHESES significant error was introduced) and observed similar effects on willingness to pay. The results have significant implications for Behavioral research has indicated that judgments can be the design and application of recommender systems as well as for constructed upon request and, consequently, are often influenced e-commerce practice. by elements of the environment. One such influence arises from the use of an anchoring-and-adjustment heuristic (Tversky and 1. INTRODUCTION Kahneman 1974; see review by Chapman and Johnson 2002), the focus of the current study. Using this heuristic, the decision Recommender systems have become commonplace in online maker begins with an initial value and adjusts it as needed to purchasing environments. Much research in information systems arrive at the final judgment. A systematic bias has been observed and computer science has focused on algorithmic design and with this process in that decision makers tend to arrive at a improving recommender systems’ performance (see Adomavicius judgment that is skewed toward the initial anchor. & Tuzhilin 2005 for a review). However, little research has explored the impact of recommender systems on consumer Past studies have largely been performed using tasks for which a behavior from an economic or decision-making perspective. verifiable outcome is being judged, leading to a bias measured Considering how important recommender systems have become against an objective performance standard (e.g., see review by in helping consumers reduce search costs to make purchase Chapman and Johnson 2002). In the recommendation setting, the decisions, it is necessary to understand how online recommender judgment is a subjective preference and is not verifiable against an systems influence purchases. objective standard. This aspect of the recommendation setting is one of the task elements illustrated in Figure 1, where accuracy is In this paper, we investigate the relationship between measured as a comparison between the rating prediction and the recommender systems and consumers’ economic behavior. consumer’s actual rating, a subjective outcome. Also illustrated Drawing on theory from behavioral economics, judgment and in Figure 1 is the feedback system involved in the use of decision-making, and marketing, we hypothesize that online recommender systems. Predicted ratings (recommendations) are recommendations 1 significantly pull a consumer’s willingness to systematically tied to the consumer’s perceptions of products. pay in the direction of the recommendation. We test our Therefore, providing consumers with a predicted “system rating” hypotheses using two controlled behavioral experiments on the can potentially introduce anchoring biases that significantly recommendation and sale of digital songs. In the first study, we influence their subsequent ratings of items. find strong evidence that randomly generated recommendations (i.e., not based on user preferences) significantly impact One of the few papers identified in the mainstream anchoring consumers’ willingness to pay, even when we control for user literature that has looked directly at anchoring effects in preferences for the song, demographic and consumption-related preference construction is that of Schkade and Johnson (1989). factors, and individual level heterogeneity. In the second study, However, their work studied preferences between abstract, stylized, simple (two-outcome) lotteries. This preference situation is far removed from the more realistic situation that we address in this work. More similar to our setting, Ariely et al. (2003) 1 In this paper, for ease of exposition, we use the term “recommendations” observed anchoring in bids provided by students participating in in a broad sense. Any rating that the consumer receives purportedly from auctions of consumer products (e.g., wine, books, chocolates) in a a recommendation system, even if negative (e.g., 1 star on a five-star classroom setting. However, participants were not allowed to scale), is termed a recommendation of the system. sample the goods, an issue we address in this study. Paper presented at the 2012 Decisions@RecSys workshop in conjunction with the 6th ACM conference on Recommender Systems. Copyright © 2012 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. 40 Predicted Ratings (expressing recommendations for unknown items) Recommender System Consumer Accuracy (Preference formation / purchasing (Consumer preference estimation) behavior / consumption) Actual Ratings (expressing preferences for consumed items) Figure 1. Ratings as part of a feedback loop in consumer-recommender interactions. Very little research has explored how the cues provided by Hypothesis 1: Participants exposed to randomly generated recommender systems influence online consumer behavior. artificially high (low) recommendations for a product will Cosley et al. (2003) dealt with a related but significantly different exhibit a higher (lower) willingness to pay for that product. anchoring phenomenon in the context of recommender systems. They explored the effects of system-generated recommendations A common issue for recommender systems is error (often on user re-ratings of movies. They found that users showed high measured by RMSE) in predicted ratings. This is evidenced by test-retest consistency when being asked to re-rate a movie with Netflix’s recent competition for a better recommendation no prediction provided. However, when users were asked to re- algorithm with the goal of reducing prediction error by 10% (Bennet and Lanning 2007). If anchoring biases can be generated rate a movie while being shown a “predicted” rating that was altered upward or downward from their original rating by a single by recommendations, then accuracy of recommender systems fixed amount of one rating point (providing a high or a low becomes all the more important. Therefore, we wish to explore anchor), users tended to give higher or lower ratings, respectively the potential anchoring effects introduced when real (compared to a control group receiving accurate original ratings). recommendations (i.e., based on the state-of-the-art recommender This showed that anchoring could affect consumers’ ratings based systems algorithms) are erroneous. We hypothesize that on preference recall, for movies seen in the past and now being significant errors in real recommendations can have similar effects evaluated. on consumers’ behavior as captured by their willingness to pay for products. Adomavicius et al. (2011) looked at a similar effect in an even more controlled setting, in which the consumer preference ratings Hypothesis 2: Participants exposed to a recommendation for items were elicited at the time of item consumption. Even that contains significant error in an upward (downward) without a delay between consumption and elicited preference, direction will exhibit a higher (lower) willingness to pay for anchoring effects were observed. The predicted ratings, when the product. perturbed to be higher or lower, affected the consumer ratings to We test these hypotheses with two controlled behavioral move in the same direction. The effects on consumer ratings are studies, discussed next. potentially important for a number of reasons, e.g., as identified by Cosley et al. (2003): (1) Biases can contaminate the inputs of the recommender system, reducing its effectiveness. (2) Biases 3. STUDY 1: RECOMMENDATIONS AND can artificially improve the resulting accuracy, providing a WILLINGNESS-TO-PAY distorted view of the system’s performance. (3) Biases might Study 1 was designed to test Hypothesis 1 and establish whether allow agents to manipulate the system so that it operates in their or not randomly generated recommendations could significantly favor. Therefore, it is an important and open research question as impact a consumer’s willingness to pay. to the direct effects of recommendations on consumer behavior. However, in addition to the preference formation and 3.1. Procedure consumption issues, there is also the purchasing decision of the Both studies presented in this paper were conducted using the consumer, as mentioned in Figure 1. Aside from the effects on same behavioral research lab at a large public North American ratings, there is the important question of the possibility of university, and participants were recruited from the university’s anchoring effects on economic behavior. Hence, the primary research participant pool. Participants were paid a $10 fee plus a focus of this research is to determine how anchoring effects $5 endowment that was used in the experimental procedure created by online recommendations impact consumers’ economic (discussed below). Summary statistics on the participant pool for behavior as measured by their willingness to pay. Based on the both Study 1 and Study 2 are presented in Table 1. Seven prior research, we expect there to be similar effects on economic participants were dropped from Study 1 because of response behavior as observed with consumer ratings and preferences. issues, leaving data on 42 participants for analysis. Specifically, we first hypothesize that recommendations will significantly impact consumers’ economic behavior by pulling The experimental procedure for Study 1 consisted of three main their willingness to pay in the direction of the recommendation, tasks, all of which were conducted on a web-based application regardless of the accuracy of the recommendation. using personal computers with headphones and dividers between 41 participants. In the first task, participants were asked to provide To capture willingness to pay, we employed the incentive- ratings for at least 50 popular music songs on a scale from one to compatible Becker-DeGroot-Marschack method (BDM) five stars with half-star increments. The songs presented for the commonly used in experimental economics (Becker et al. 1984). initial rating task were randomly selected from a pool of 200 For each song presented during the third task of the study, popular songs, which was generated by taking the songs ranked in participants were asked to declare a price they were willing to pay the bottom half of the year-end Billboard 100 charts from 2006 between zero and 99 cents. Participants were informed that five and 2009. 2 For each song, the artist name(s), song title, duration, songs selected at random at the end of the study would be album name, and a 30-second sample were provided. The assigned random prices, based on a uniform distribution, between objective of the song-rating task was to capture music preferences one and 99 cents. For each of these five songs, the participant from the participants so that recommendations could later be was required to purchase the song using money from their $5 generated using a recommendation algorithm (in Study 2 and endowment at the randomly assigned price if it was equal to or post-hoc analysis of Study 1, as discussed later). below their declared willingness to pay. Participants were presented with a detailed explanation of the BDM method so that Table 1 Participant summary statistics. they understood that the procedure incentivizes accurate reporting of their prices, and were required to take a short quiz on the Study 1 Study 2 method and endowment distribution before starting the study. # of Participants (n) 42 55 At the conclusion of the study, they completed a short survey collecting demographic and other individual information for use Average Age (years) 21.5 (1.95) 22.9 (2.44) in the analyses. The participation fee and the endowment minus fees paid for the required purchases were distributed to 28 Female, 31 Female, participants in cash. MP3 versions of the songs purchased by Gender 14 Male 24 Male participants were “gifted” to them through Amazon.com Prior experience with approximately within 12 hours after the study was concluded. 50% (21/42) 47.3% (26/55) recommender systems 36 undergrad, 6 27 undergrad, 3.2. Analysis and Results Student Level We start by presenting a plot of the aggregate means of grad 25 grad, 3 other Buy new music at least willingness to pay for each of the treatment groups in Figure 2. 66.7% (28/42) 63.6% (35/55) Note that, although there were three treatment groups, the actual once a month Own more than 1000 ratings shown to the participants were randomly assigned star 50% (21/42) 47.3% (26/55) ratings from within the corresponding treatment group range (low: songs 1.0-2.0 stars, mid: 2.5-3.5 stars, high: 4.0-5.0 stars). As an initial analysis, we performed a repeated measure ANOVA, In the second task, a different list of songs was presented (with the as shown in Table 2, demonstrating a statistically significant same information for each song as in the first task) from the same effect of the shown rating on willingness to pay. Since the overall set of 200 songs. For each song, the participant was asked treatment effect was significant, we followed with pair-wise whether or not they owned the song. Songs that were owned were contrasts using t-tests across treatment levels and against the excluded from the third task, in which willingness-to-pay control group as shown in Table 3. All three treatment conditions judgments were obtained. When the participants identified at significantly differed, showing a clear, positive effect of the least 40 songs that they did not own, the third task was initiated. treatment on economic behavior. In the third main task of Study 1, participants completed a within- To provide additional depth for our analysis, we used a panel data subjects experiment where the treatment was the star rating of the regression model to explore the relationship between the shown song recommendation and the dependent variable was willingness star rating (continuous variable) and willingness to pay, while to pay for the songs. In the experiment, participants were controlling for participant level factors. A Hausman test was presented with 40 songs that they did not own, which included a conducted, and a random effects model was deemed appropriate, star rating recommendation, artist name(s), song title, duration, which also allowed us to account for participant level covariates album name, and a 30 second sample for each song. Ten of the 40 in the analysis. The dependent variable, i.e., willingness to pay, songs were presented with a randomly generated low was measured on an integer scale between 0 and 99 and skewed recommendation between one and two stars (drawn from a toward the lower end of the scale. This is representative of typical uniform distribution; all recommendations were presented with a count data; therefore, a Poisson regression was used one decimal place precision, e.g., 1.3 stars), ten were presented (overdispersion was not an issue). The main independent variable with a randomly generated high recommendation between four was the shown star rating of the recommendation, which was and five stars, ten were presented with a randomly generated mid- continuous between one and five stars. Control variables for range recommendation between 2.5 and 3.5 stars, and ten were several demographic and consumer-related factors were included, presented with no recommendation to act as a control. The 30 which were captured in the survey at the end of the study. songs presented with recommendations were randomly ordered, Additionally, we controlled for the participants’ preferences by and the 10 control songs were presented last. calculating an actual predicted star rating recommendation for each song (on a 5 star scale with one decimal precision), post hoc, using the popular and widely-used item-based collaborative 2 The Billboard 100 provides a list of popular songs released in each year. The top half of each year’s list was not used to reduce the number of songs in our database that participants would already own. 42 filtering algorithm (IBCF) (Sarwar et al. 2001). 3 By including (interval five point scale) for the music genres country, rock, hip this predicted rating (which was not shown to the participant hop, and pop, the number of songs owned (interval five point during the study) in the analysis, we are able to determine if the scale), frequency of music purchases (interval five point scale), random recommendations had an impact on willingness to pay whether they thought recommendations in the study were accurate above and beyond the participant’s predicted preferences. (interval five point scale), and whether they thought the recommendations were useful (interval five point scale). The composite error term (ui + εij) includes the individual participant effect ui and the standard disturbance term εij. log(WTPij)= b0 + b1(ShownRatingij)+ b2(PredictedRatingij) + b3(Controlsi) + ui + εij The results of the regression are shown in Table 4. Note that the control observations were not included, since they had null values for the main dependent variable ShownRating. The results of our analysis for Study 1 provide strong support for Hypothesis 1 and demonstrate clearly that there is a significant effect of recommendations on consumers’ economic behavior. Specifically, we have shown that even randomly generated recommendations with no basis on user preferences can impact consumers’ perceptions of a product and, thus, their willingness to pay. The regression analysis goes further and controls for participant level factors and, most importantly, the participant’s Figure 2. Study 1 treatment means. predicted preferences for the product being recommended. As can be seen in Table 4, after controlling for all these factors, a one unit Table 2. Study 1 repeated measures ANOVA. change in the shown rating results in a 0.168 change (in the same Partial direction) in the log of the expected willingness to pay (in cents). Degrees of Mean F P As an example, assuming a consumer has a willingness to pay of Sum of Freedom Square Statistic value $0.50 for a specific song and is given a recommendation, Squares increasing the recommendation star rating by one star would Participant 396744.78 41 9676.70 increase the consumer’s willingness to pay to $0.59. Treatment Table 4. Study 1 regression results 24469.41 2 12234.70 42.27 <0.000 Level Dependent Variable: log(Willingness to Pay) Residual 346142.41 1196 289.42 Variable Coefficient Std. Error ShownRating 0.168*** 0.004 PredictedRating 0.323*** 0.015 Total 762747.40 1239 615.62 Controls male -0.636** 0.289 undergrad -0.142 0.642 Table 3. Comparison of aggregate treatment group means age -0.105 0.119 with t-tests. usedRecSys -0.836** 0.319 Control Low Mid country 0.103 0.108 Low (1-2 Star) 4.436*** rock 0.125 0.157 Mid (2.5-3.5 Star) 0.555 4.075*** hiphop 0.152 0.132 High (4-5 Star) 1.138 5.501*** 1.723** pop 0.157 0.156 * p<0.1, ** p<0.05, *** p <0.01 recomUseful -0.374 0.255 2-tailed t-test for Control vs. Mid, all else 1-tailed. recomAccurate 0.414* 0.217 buyingFreq -0.180 0.175 The resulting Poisson regression model is shown below, where songsOwned -0.407* 0.223 WTPij is the reported willingness to pay for participant i on song j, constant 4.437 3.414 ShownRatingij is the recommendation star rating shown to Number of Obs. 1240 participant i for song j, PredictedRatingij is the predicted Number of Participants 42 recommendation star rating for participant i on song j, and Log-likelihood -9983.3312 Controlsi is a vector of demographic and consumer-related Wald Chi-Square Statistic 1566.34 variables for participant i. The controls included in the model (p-value) (0.0000) were gender (binary), age (integer), school level (undergrad * p<0.1, ** p<0.05, *** p <0.01 yes/no binary), whether they have prior experience with recommendation systems (yes/no binary), preference ratings 4. STUDY 2: ERRORS IN RECOMMENDATIONS 3 Several recommendation algorithms were evaluated based on the Study 1 training data, and IBCF was selected for us in both studies because it The goal of Study 2 was to extend the results of Study 1 by testing had the highest predictive accuracy. Hypothesis 2 and exploring the impact of significant error in true 43 recommendations on consumers’ willingness to pay. As from the model used in Study 1 is the inclusion of Perturbationij discussed below, the design of this study is intended to test for (i.e., the error introduced for the recommendation of song j to similar effects as Study 1, but in a more realistic setting with participant i) as the main independent variable. The results are recommender-system-generated, real-time recommendations.. presented in Table 5. 4.1. Procedure log(WTPij)= b0 + b1(Perturbationij)+ b2(PredictedRatingij) Participants in Study 2 used the same facilities and were recruited + b3(Controlsi ) + ui + εij from the same pool as in Study 1; however, there was no overlap The results of Study 2 provide strong support for Hypothesis 2 in participants across the two studies. The same participation fee and extend the results of Study 1 in two important ways. First, and endowment used in Study 1 was provided to participants in Study 2 provides more realism to the analysis, since it utilizes real Study 2. 15 participants were removed from the analysis in Study recommendations generated using an actual real-time 2 because of issues in their responses, leaving data on 55 recommender system. Second, rather than randomly assigning participants for analysis. recommendations as in Study 1, in Study 2 the recommendations Study 2 was also a within-subjects design with perturbation of the presented to participants were calculated based on their recommendation star rating as the treatment and willingness to preferences and then perturbed to introduce realistic levels of pay as the dependent variable. The main tasks for Study 2 were system error. Thus, considering the fact that all recommender virtually identical to those in Study 1. The only differences systems have some level of error in their recommendations, Study between the studies were the treatments and the process for 2 contributes by demonstrating the potential impact of these assigning stimuli to the participants in the recommendation task of errors. As seen in Table 5, while controlling for preferences and the study. In Study 2, all participants completed the initial song- other factors, a one unit perturbation in the actual rating results in rating and song ownership tasks as in Study 1. Next, real song a 0.115 change in the log of the participant’s willingness to pay. recommendations were calculated based on the participants’ As an example, assuming a consumer has a willingness to pay of preferences, which were then perturbed (i.e., excess error was $0.50 for a given song, perturbing the system’s recommendation introduced to each recommendation) to generate the shown positively by one star would increase the consumer’s willingness recommendation ratings. In other words, unlike Study 1 in which to pay to $0.56. random recommendations were presented to participants, in Study Table 5. Study 2 regression results. 2 participants were presented with perturbed versions of their actual personalized recommendations. Perturbations of -1.5 stars, Dependent Variable: log(Willingness to Pay) -1 star, -0.5 stars, 0 stars, +0.5 stars, +1 star, and +1.5 stars were Variable Coefficient Std. Error added to the actual recommendations, representing seven Perturbation 0.115*** 0.005 treatment levels. The perturbed recommendation shown to the PredictedRating 0.483*** 0.012 participant was constrained to be between one and five stars, Controls therefore perturbations were pseudo-randomly assigned to ensure male -0.045 0.254 that the sum of the actual recommendation and the perturbation undergrad -0.092 0.293 would fit within the allowed rating scale. The recommendations age -0.002 0.053 were calculated using the item-based collaborative filtering usedRecSys 0.379 0.253 (IBCF) algorithm (Sarwar et al. 2001), and the ratings data from country -0.056 0.129 Study 1 was used as training data. rock -0.132 0.112 hiphop 0.0137 0.108 Each participant was presented with 35 perturbed, personalized pop -0.035 0.124 song recommendations, five from each of the seven treatment recomUseful 0.203* 0.112 levels. The song recommendations were presented in a random recomAccurate 0.060 0.161 order. Participants were asked to provide their willingness to pay buyingFreq 0.276** 0.128 for each song, which was captured using the same BDM songsOwned -0.036 0.156 technique as in Study 1. The final survey, payouts, and song constant 0.548 1.623 distribution were also conducted in the same manner as in Study Number of Obs. 1925 1. Number of Participants 55 Log-likelihood -16630.547 4.2. Analysis and Results Wald Chi-Square Statistic 2374.72 For Study 2, we focus on the regression analysis to determine the (p-value) (0.0000) relationship between error in a recommendation and willingness * p<0.1, ** p<0.05, *** p <0.01 to pay. We follow a similar approach as in Study 1 and model this relationship using a Poisson random effects regression model. The distribution of willingness to pay data in Study 2 was similar 5. CONCLUSIONS to that of Study 1, overdispersion was not an issue, and the results of a Hausman test for fixed versus random effects suggested that a Study 1 provided strong evidence that willingness to pay can be random effects model was appropriate. We control for the affected by online recommendations through a randomized trial participants’ preferences using the predicted rating for each song design. Participants presented with random recommendations in the study (i.e., the recommendation rating prior to were influenced even when controlling for demographic factors perturbation), which was calculated using the IBCF algorithm. and general preferences. Study 2 extended these results to Furthermore, the same set of control variables used in Study 1 was demonstrate that the same effects exist for real recommendations included in our regression model for Study 2. The resulting that contain errors, which were calculated using the state-of-the- regression model is presented below, where the main difference art recommendation algorithms used in practice. 44 There are significant implications of the results presented. First, of-the-Art and Possible Extensions. IEEE Transactions on the results raise new issues on the design of recommender Knowledge and Data Engineering, 17 (6) pp. 734-749. systems. If recommender systems can generate biases in [3] Ariely, D., Lewenstein, G., and Prelec, D. 2003. “Coherent consumer decision-making, do the algorithms need to be adjusted arbitrariness”: Stable demand curves without stable to compensate for such biases? Furthermore, since recommender preferences, Quarterly Journal of Economics (118), pp. 73- systems use a feedback loop based on consumer purchase 105. decisions, do recommender systems need to be calibrated to handle biased input? Second, biases in decision-making based on [4] Becker G.M., DeGroot M.H., and Marschak J. 1964. online recommendations can potentially be used to the advantage Measuring utility by a single-response sequential method. of e-commerce companies, and retailers can potentially become Behavioral Science, 9 (3) pp. 226–32. more strategic in their use of recommender systems as a means of [5] Bennet, J. and Lanning, S. 2007. The Netflix Prize. KDD increasing profit and marketing to consumers. Third, consumers Cup and Workshop. [www.netflixprize.com]. may need to become more cognizant of the potential decision [6] Chapman, G. and Johnson, E. 2002. Incorporating the making biases introduced through online recommendations. Just irrelevant: anchors in judgments of belief and value. as savvy consumers understand the impacts of advertising, Heuristics and Biases: The Psychology of Intuitive Judgment, discounting, and pricing strategies, they may also need to consider T. Gilovich, D. Griffin and D. Kahneman (eds.), Cambridge the potential impact of recommendations on their purchasing University Press, Cambridge, pp. 120-138. decisions. [7] Cosley, D., Lam, S., Albert, I., Konstan, J.A., and Riedl, J. 2003. Is seeing believing? How recommender interfaces 6. ACKNOWLEDGMENT affect users’ opinions. CHI 2003 Conference, Fort This work is supported in part by the National Science Foundation Lauderdale FL. grant IIS-0546443. [8] Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. 2001. Item-based collaborative filtering algorithms. 10th Annual REFERENCES World Wide Web Conference (WWW10), May 1-5, Hong Kong. [1] Adomavicius, G., Bockstedt, J., Curley, S., and Zhang, J. 2011. Recommender Systems, Consumer Preferences, and [9] Schkade, D.A. and Johnson, E.J. 1989. Cognitive processes Anchoring Effects. Proceedings of the RecSys 2011 in preference reversals. Organizational Behavior and Human Decision Processes, (44), pp. 203-231. Workshop on Human Decision Making in Recommender Systems (Decisions@RecSys 2011), Chicago IL, October 27, [10] Tversky, A., and Kahneman, D. 1974. Judgment under pp. 35-42. uncertainty: Heuristics and biases. Science, (185), pp. 1124- 1131. [2] Adomavicius, G. and Tuzhilin, A. 2005. Towards the Next Generation of Recommender Systems: A Survey of the State- 45