Recommender Systems, Consumer Preferences, and
                     Anchoring Effects
Gediminas Adomavicius                            Jesse Bockstedt                          Shawn Curley                       Jingjing Zhang
    University of Minnesota                  George Mason University                  University of Minnesota            University of Minnesota
       Minneapolis, MN                             Fairfax, VA                           Minneapolis, MN                    Minneapolis, MN
      gedas@umn.edu                            jbockste@gmu.edu                         curley@umn.edu                    jingjing@umn.edu

ABSTRACT                                                                           recommender system’s use and value, as illustrated in Figure 1.
Recommender systems are becoming a salient part of many e-commerce                 The figure also illustrates how consumer ratings are commonly
websites. Much research has focused on advancing recommendation                    used to evaluate the recommender system’s performance in terms
technologies to improve the accuracy of predictions, while behavioral              of accuracy by comparing how closely the system-predicted
aspects of using recommender systems are often overlooked. In this                 ratings match the later submitted actual ratings by the users. In
study, we explore how consumer preferences at the time of consumption              our studies, we focus on the feed-forward influence of the
are impacted by predictions generated by recommender systems. We                   recommender system upon the consumer ratings that, in turn,
conducted three controlled laboratory experiments to explore the effects of
                                                                                   serve as inputs to these same systems. We believe that providing
system recommendations on preferences. Studies 1 and 2 investigated
user preferences for television programs, which were surveyed                      consumers with a prior rating generated by the recommender
immediately following program viewing. Study 3 broadened to an                     system can introduce anchoring biases and significantly influence
additional context—preferences for jokes.         Results provide strong           consumer preferences and, thus, their subsequent rating of an
evidence viewers’ preferences are malleable and can be significantly               item. As noted by [7], biases in the ratings provided by users can
influenced by ratings provided by recommender systems. Additionally,               lead to three potential problems: (i) biases can contaminate the
the effects of pure number-based anchoring can be separated from the               inputs of the recommender system, reducing its effectiveness; (ii)
effects of the perceived reliability of a recommender system. Finally, the         biases can artificially improve the resulting accuracy, providing a
effect of anchoring is roughly continuous, operating over a range of
                                                                                   distorted view of the system’s performance; (iii) biases might
perturbations of the system.
                                                                                   allow agents to manipulate the system so that it operates in their
1. INTRODUCTION                                                                    favor.
Recommender systems have become important decision aids in
                                                                                    Predicted Ratings (expressing recommendations for unknown items)
the electronic marketplace and an integral part of the business
models of many firms. Such systems provide suggestions to
consumers of products in which they may be interested and allow
firms to leverage the power of collaborative filtering and feature-                 Recommender System
                                                                                    (Consumer preference          Accuracy             Consumer
based recommendations to better serve their customers and                                                                          (Item consumption)
                                                                                         estimation)
increase sales. In practice, recommendations significantly impact
the decision-making process of many online consumers; for
example, it has been reported that a recommender system could
account for 10-30% of an online retailer’s sales [25] and that                             Actual Ratings (expressing preferences for consumed items)
roughly two-thirds of the movies rented on Netflix were ones that
users may never have considered if they had not been                                 Figure 1. Ratings as part of a feedback loop in consumer-
recommended to users by the recommender system [10].                                                recommender interactions.
Research in the area of recommender systems has focused almost                     For algorithm developers, the issue of biased ratings has been
exclusively on the development and improvement of the                              largely ignored. A common underlying assumption in the vast
algorithms that allow these systems to make accurate                               majority of recommender systems literature is that consumers
recommendations and predictions. Less well-studied are the                         have preferences for products and services that are developed
behavioral aspects of using recommender systems in the                             independently of the recommendation system.             However,
electronic marketplace.                                                            researchers in behavioral decision making, behavioral economics,
                                                                                   and applied psychology have found that people’s preferences are
Many recommender systems ask consumers to rate an item that
                                                                                   often influenced by elements in the environment in which
they have previously experienced or consumed. These ratings are
                                                                                   preferences are constructed [5,6,18,20,30]. This suggests that the
then used as inputs by recommender systems, which employ
                                                                                   common assumption that consumers have true, non-malleable
various computational techniques (based on methodologies from
                                                                                   preferences for items is questionable, which raises the following
statistics, data mining, or machine learning) to estimate consumer
                                                                                   question: Whether and to what extent is the performance of
preferences for other items (i.e., items that have not yet been
                                                                                   recommender systems reflective of the process by which
consumed by a particular individual).             These estimated
                                                                                   preferences are elicited? In this study, our main objective is to
preferences are often presented to the consumers in the form of
                                                                                   answer the above question and understand the influence of a
“system ratings,” which indicate an expectation of how much the
                                                                                   recommender system’s predicted ratings on consumers’
consumer will like the item based on the recommender system
                                                                                   preferences. In particular, we explore four issues related to the
algorithm and, essentially, serve as recommendations. The
                                                                                   impact of recommender systems: (1) The anchoring issue—
subsequent consumer ratings serve as additional inputs to the
                                                                                   understanding any potential anchoring effect, particularly at the
system, completing a feedback loop that is central to a
                                                                                   point of consumption, is the principal goal of this study: Are
Copyright is held by the author/owner(s).                                          people’s preference ratings for items they just consumed drawn
Decisions@RecSys’11, October 27, 2011, Chicago, IL, USA.                           toward predictions that are given to them? (2) The timing issue—


                                                                              35
does it matter whether the system’s prediction is presented before           that users showed high test-retest consistency when being asked to
or after user’s consumption of the item? This issue relates to one           re-rate a movie with no prediction provided. However, when
possible explanation for an anchoring effect. Showing the                    users were asked to re-rate a movie while being shown a
prediction prior to consumption could provide a prime that                   “predicted” rating that was altered upward/downward from their
influences the user’s consumption experience and his/her                     original rating for the movie by a single fixed amount (1 rating
subsequent rating of the consumed item. If this explanation is               point), they tended to give higher/lower ratings, respectively.
operative, an anchoring effect would be expected to be lessened
when the recommendation is provided after consumption. (3) The               Although [7] did involve recommender systems and preferences,
system reliability issue—does it matter whether the system is                our study differs from theirs in important ways. First, we address
characterized as more or less reliable? Like the timing issue, this          a fuller range of possible perturbations of the predicted ratings.
issue is directed at illuminating the nature of the anchoring effect,        This allows us to more fully explore the anchoring issue as to
if obtained. If the system’s reliability impacts anchoring, then this        whether any effect is obtained in a discrete fashion or more
would provide evidence against the thesis that anchoring in                  continuously over the range of possible perturbations. More
recommender systems is a purely numeric effect of users applying             fundamentally, the focus of [7] was on the effects of anchors on a
numbers to their experience. (4) The generalizability issue—does             recall task, i.e., users had already “consumed” (or experienced)
the anchoring effect extend beyond a single context? We                      the movies they were asked to re-rate in the study, had done so
investigate two different contexts in the paper. Studies 1 and 2             prior to entering the study, and were asked to remember how well
observe ratings of TV shows in a between-subjects design. Study              they liked these movies from their past experiences. Thus,
3 addresses anchoring for ratings of jokes using a within-subjects-          anchoring effects were moderated by potential recall-related
design. Consistency of our findings supports a more general                  phenomenon, and preferences were being remembered instead of
phenomenon that affects preference ratings immediately following             constructed. In contrast, our work focuses on anchoring effects
consumption, when recommendations are provided.                              that occur in the construction of preferences at the time of actual
                                                                             consumption. In our study, no recall is involved in the task
2. BACKGROUND AND HYPOTHESES                                                 impacted by anchors, participants consume the good for the first
Behavioral research has indicated that judgments can be                      time in our controlled environment, and we measure the
constructed upon request and, consequently, are often influenced             immediate effects of anchoring.
by elements of the environment in which this construction occurs.            Still, [7] provide a useful model for the design of our studies, with
One such influence arises from the use of an anchoring-and-                  two motivations in mind. First, their design provides an excellent
adjustment heuristic [6,30], the focus of the current study. Using           methodology for exploring the effects of recommender systems on
this heuristic, the decision maker begins with an initial value and          preferences. Second, we build upon their findings to determine if
adjusts it as needed to arrive at the final judgment. A systematic           anchoring effects of recommender systems extend beyond recall-
bias has been observed with this process in that decision makers             related tasks and impact actual preference construction at the time
tend to arrive at a judgment that is skewed toward the initial               of consumption. Grounded in the explanations for anchoring, as
anchor. Prior research on anchoring effects spans three decades              discussed above, our research goes beyond their findings to see if
and represents a very important aspect of decision making,                   recommender system anchoring effects are strong enough to
behavioral economics, and marketing literatures. Epley and                   manipulate a consumer’s perceptions of a consumption experience
Gilovich [9] identified three waves of research on anchoring: (1)            as it is happening.
establishes anchoring and adjustment as leading to biases in
judgment [5,9,21,29,30], (2) develops psychological explanations             Since anchoring has been observed in other settings, though
                                                                             different than the current preference setting, we begin with the
for anchoring effects [5,9,13,21,23], and (3) unbinds anchoring
                                                                             conjecture that the rating provided by a recommender system
from its typical experimental setting and “considers anchoring in
                                                                             serves as an anchor. Insufficient adjustment away from the
all of its everyday variety and examines its various moderators in
                                                                             anchor is expected to lead to a subsequent consumer preference
these diverse contexts” ([9], p.21) [14,17]. Our study is primarily          rating that is shifted toward the system’s predicted rating. This is
located within the latter wave while informing the second wave—              captured in the following primary hypothesis of the studies:
testing explanations—as well; specifically, our paper provides a
contribution both (a) to the study of anchoring in a preference                Anchoring Hypothesis: Users receiving a recommendation
situation at the time of consumption and (b) to the context of                 biased to be higher will provide higher ratings than users
recommender systems.                                                           receiving a recommendation biased to be lower.

Regarding the former of these contextual features, the effect of             One mechanism that may underlie an anchoring effect with
anchoring on preference construction is an important open issue.             recommendations is that of priming, whereby the anchor can serve
Past studies have largely been performed using tasks for which a             as a prime or prompt that activates information similar to the
verifiable outcome is being judged, leading to a bias measured               anchor, particularly when uncertainty is present [6]. If this
against an objective performance standard (also see review by [6].           dynamic operates in the current setting, then receiving the
In the recommendation setting, the judgment is a subjective                  recommendation prior to consumption, when uncertainty is higher
preference and is not verifiable against an objective standard. The          and priming can more easily operate, should lead to greater
application of previous studies to the preference context is not a           anchoring effects than receiving the recommendation after
straightforward generalization.                                              consumption. Manipulating the timing of the recommendation
                                                                             provides evidence for tying any effects to priming as an
Regarding our studies’ second contextual feature, very little                underlying mechanism.
research has explored how the cues provided by recommender
systems influence online consumer behavior. The work that                      Timing Hypothesis: Users receiving a recommendation prior
comes closest to ours is [7], which explored the effects of system-            to consumption will provide ratings that are closer to the
generated recommendations on user re-ratings of movies. It found               recommendation (i.e., will be more affected by the anchor)
                                                                               than users receiving a recommendation after viewing.


                                                                        36
Another explanation proposed for the anchoring effect is a                   Hypothesis) using artificial anchors; (2) to perform the
content-based explanation, in which the user perceives the anchor            exploratory analyses of whether participants behave differently
as providing evidence as to a correct answer in situations where             with high vs. low anchors; (3) to test the Timing Hypothesis for
an objective standard exists. When applied to the use of                     anchoring effects with system recommendations (i.e., concerning
recommender systems and preferences, the explanation might                   differential effects of receiving the recommendation either before
surface as an issue of the consumer’s trust in the system. Prior             or after consuming the item to be subsequently rated) ; (4) to test
study found that increasing cognitive trust and emotional trust              the Perceived System Reliability Hypothesis for anchoring effects
improved consumer’s intentions to accept the recommendations                 with system recommendations (i.e., concerning the relationship
[15]. Research also has highlighted the potential role of human-             between the perceived reliability of the recommender system and
computer interaction and system interface design in achieving                anchoring effects of its recommendations); and (5) to build a
high consumer trust and acceptance of recommendations                        database of user preferences for television shows, which would be
[7,19,22,28]. However, the focus of these studies differs from               used in computing personalized recommendations for Study 2.
that underlying our research questions.         In particular, the
aforementioned prior studies focused on interface design                     3.1.    Methods
(including presentation of items, explanation facilities, and rating         216 people completed the study. Ten respondents indicated
scale definitions) rather than the anchoring effect of                       having seen some portion of the show that was used in the study
recommendations on the construction of consumer preferences.                 (all subjects saw the same TV show episode in Study 1).
Our work was motivated in part by these studies to specifically              Excluding these, to obtain a more homogeneous sample of
highlight the role of anchoring on users’ preference ratings.                subjects all seeing the show for the first time, left 206 subjects for
                                                                             analysis. Participants were solicited from a paid subject pool and
In their initial studies, Tversky and Kahneman [30] used anchors             paid a fixed fee at the end of the study.
that were, explicitly to the subjects, determined by spinning a
wheel of fortune. They still observed an effect of the magnitude             In Study 1 subjects received artificial anchors, i.e., system ratings
of the value from this random spin upon the judgments made (for              were not produced by a recommender system. All subjects were
various almanac-type quantities, e.g., the number of African                 shown the same TV show episode during the study and were
countries in the United Nations). [27] also demonstrated                     asked to provide their rating of the show after viewing.
anchoring effects even with extreme values (e.g., anchors of 1215            Participants were randomly assigned to one of seven experimental
or 1992 in estimating the year that Einstein first visited the United        groups. Before providing their rating, those in the treatment
States). These studies suggest that the anchoring effect may be              groups received an artificial system rating for the TV show used
purely a numerical priming phenomenon, and that the quality of               in this study. Three factors were manipulated in the rating
the anchor may be less important.                                            provision. First, the system rating was set to have either a low
                                                                             (1.5, on a scale of 1 through 5) or high value (4.5). Since [29]
In contrast, [20] found that the anchoring effect was mediated by            found an asymmetry of the anchoring effect such that high
the plausibility of the anchor. The research cited earlier                   anchors produced a larger effect than did low anchors in their
connecting cognitive trust in recommendation agents to users’                study of job performance ratings, we used anchors at both ends of
intentions to adopt them [15] also suggests a connection between             the scale.
reliability and use. To the extent that the phenomenon is purely
numerically driven, weakening of the recommendation should                   The second factor in Study 1 was the timing of the
have little or no effect. To the extent that issues of trust and             recommendation. The artificial system rating was given either
quality are of concern, a weakening of the anchoring should be               before or after the show was watched (but always before the
observed with a weakening of the perceived quality of the                    viewer was asked to rate the show). This factor provides a test of
recommending system.                                                         the Timing Hypothesis. Together, the first two factors form a 2 x
                                                                             2 (High/Low anchor x Before/After viewing) between-subjects
  Perceived System Reliability Hypothesis: Users receiving a                 design (the top four cells of the design in Table 1).
  recommendation from a system that is perceived as more
  reliable will provide ratings closer to the recommendation                 Intersecting with this design is the use of a third factor: the
  (i.e., will be more affected by the anchor) than users                     perceived reliability of the system (strong or weak) making the
  receiving a recommendation from a less reliable system.                    recommendation. In the Strong conditions for this factor, subjects
                                                                             were told (wording is for the Before viewing/Low anchor
To explore our hypotheses, we conducted three controlled                     condition): “Our recommender system thinks that you would rate
laboratory experiments, in which system predictions presented to             the show you are about to see as 1.5 out of 5.” Participants in the
participants are biased upward and downward so our hypotheses                corresponding Weak conditions for the perceived reliability factor
can be tested in realistic settings. The first study explores our            saw: “We are testing a recommender system that is in its early
hypotheses by presenting participants with randomly assigned                 stages of development. Tentatively, this system thinks that you
artificial system recommendations. The second study extends the              would rate the show you are about to see as 1.5 out of 5.” This
first and uses a live, real-time recommender system to produce               factor provides a test of the Perceived System Reliability
predicted recommendations for our participants, which are then               Hypothesis. At issue is whether any effect of anchoring upon a
biased upward or downward. The final study generalizes to                    recommendation is merely a numerical phenomenon or is tied to
preferences among jokes, studied using a within-subjects design              the perceived reliability and quality of the recommendation.
and varying levels of rating bias. The next three sections provide
details about our experiments and findings.                                  Since there was no basis for hypothesizing an interaction between
                                                                             timing of the recommendation and strength of the system, the
3. STUDY 1: IMPACT OF ARTIFICIAL                                             complete factorial design of the three factors was not employed.
   RECOMMENDATIONS                                                           For parsimony of design, the third factor was manipulated only
The goals of Study 1 were fivefold: (1) to perform a test of the             within the Before conditions, for which the system
primary conjecture of anchoring effects (i.e., Anchoring                     recommendation preceded the viewing of the TV show. Thus,


                                                                        37
within the Before conditions of the Timing factor, the factors of             collected by survey, including both demographic data (e.g.,
Anchoring (High/Low) and Reliability of the anchor                            gender, age, occupation) and questionnaire responses (e.g., hours
(Strong/Weak) form a 2x2 between-subjects design (the bottom                  watching TV per week, general attitude towards recommender
four cells of the design in Table 1).                                         systems), as covariates and random factors. However, none of
                                                                              these variables or their interaction terms turned out to be
In addition to the six treatment groups, a control condition, in              significant, and hence we focus on the three fixed factors.
which no system recommendation was provided, was also
included. The resulting seven experimental groups, and the                    We begin with analysis of the 2x2 between-subjects design
sample sizes for each group, are shown in Table 1.                            involving the factors of direction of anchor (High/Low) and its
                                                                              timing (Before/After viewing). As is apparent from Table 2 (rows
Subjects participated in the study using a web-based interface in a           marked as Design 1), and applying a general linear model, there is
behavioral lab, which provided privacy for individuals                        no effect of Timing (F(1,113) = 0.021, p = .885). The interaction
participating together. Following a welcome screen, subjects                  of Timing and High/Low anchor was also not significant (F(1,
were shown a list of 105 popular, recent TV shows. TV shows                   113) = 0.228, p = .634). There is a significant observed anchoring
were listed alphabetically within five genre categories: Comedy,              effect of the provided artificial recommendation (F(1, 113) =
Drama, Mystery/Suspense, Reality, and Sci Fi/Fantasy. For each                14.30, p = .0003). The difference between the High and Low
show they indicated if they had ever seen the show (multiple                  conditions was in the expected direction, showing a substantial
episodes, one episode, just a part of an episode, or never), and              effect between groups (one-tailed t(58) = 2.788, p = .0035,
then rated their familiarity with the show on a 7-point Likert scale          assuming equal variances). Using Cohen’s (1988) d, which is an
ranging from “Not at all familiar” to “Very familiar.” Based on               effect size measure used to indicate the standardized difference
these responses, the next screen first listed all those shows that the        between two means (as computed by dividing the difference
subject indicated having seen and, below that, shows they had not             between two means by a standard deviation for the data), the
seen but for which there was some familiarity (rating of 2 or                 effect size is 0.71, in the medium-to-large range.
above). Subjects rated each of these shows using a 5-star scale
that used verbal labels parallel to those in use by Netflix.com.                  Table 2. Mean (SD) Ratings of the Viewed TV Show by
Half-star ratings were also allowed, so that subjects had a 9-point                        Experimental Condition in Study 1.
scale for expressing preference. In addition, for each show, an
                                                                                Design     Design     Group (timing-anchor-    N     Mean (SD)
option of “Not able to rate” was provided. Note that these ratings                1          2        reliability)
were not used to produce the artificial system recommendations in                 *          *        Before-High-Strong       31    3.48 (1.04)
Study 1; instead, they were collected to create a database for the                *                   After-High-Strong        28    3.43b (0.81)
recommender system used in Study 2 (to be described later).                                           Control                  29    3.22 (0.98)
                                                                                              *       Before-High-Weak         31    3.08 (1.07)
Table 1 Experimental Design and Sample Sizes in Study 1.
                                                                                              *       Before-Low-Weak          29    2.83 (0.75)
   Control: 29                                                                     *                  After-Low-Strong         29    2.88 (0.79)
   Reliability condition   Timing condition   Low        High                      *          *       Before-Low-Strong        29    2.78 (0.92)
                                              (anchor)   (anchor)
   Strong (reliability)    After (timing)     29         28                   Using only data within the Before conditions, we continue by
   Strong (reliability)    Before (timing)    29         31                   analyzing the second 2 x 2 between-subjects design in the study
   Weak (reliability)      Before (timing)    29         31                   (Table 2, rows marked as Design 2), involving the factors of
                                                                              direction of anchor (High/Low) and perceived system reliability
Following the rating task, subjects watched a TV episode. All                 (Strong/Weak).       The anticipated effect of weakening the
subjects saw the same episode of a situation comedy. A less well-             recommender system is opposite for the two recommendation
known TV show was chosen to maximize the likelihood that the                  directions. A High-Weak recommendation is expected to be less
majority of subjects were not familiar with it. The episode was               pulled in the positive direction compared to a High-Strong
streamed from Hulu.com and was 23 minutes 36 seconds in                       recommendation; and, a Low-Weak recommendation is expected
duration. The display screen containing the episode player had a              to be less pulled in the negative direction as compared to Low-
visible time counter moving down from 20 minutes, forcing the                 Strong. So, we explore these conjectures by turning to the direct
respondents to watch the video for at least this time before the              tests of the contrasts of interest. There is no significant difference
button to proceed to the next screen was enabled.                             between the High and Low conditions with Weak
Either immediately preceding (in the Before conditions) or                    recommendations (t(58) = 1.053, p = .15), unlike with Strong
immediately following (in the After conditions) the viewing                   recommendations (as noted above, p = .0035). Also, the overall
display, subjects saw a screen providing the system                           effect was reduced for the Weak setting, compared to the Strong
recommendation with the wording appropriate to their condition                recommendation setting, and was measured as a Cohen’s d = 0.16,
(Strong/Weak, Low/High anchor). This screen was omitted in the                less than even the small effect size range. Thus, the subjects were
Control condition. Following, subjects rated the episode just                 sensitive to the perceived reliability of the recommender system.
viewed. The same 5-star (9-point) rating scale used earlier was               Weak recommendations did not operate as a significant anchor
provided for the preference rating, except that the “Not able to              when the perceived reliability of the system was lowered.
rate” option was omitted. Finally, subjects completed a short                 Finally, we check for asymmetry of the anchoring effect using the
survey that included questions on demographic information and                 control group in comparison to the Before-High and Before-Low
TV viewing patterns.                                                          groups. (Similar results were obtained using the After-High and
                                                                              After-Low conditions as comparison, or using the combined High
3.2. Results                                                                  and Low groups.) In other words, we already showed that the
All statistical analyses were performed using SPSS 17.0. Table 2              High and Low groups were significantly different from each
shows the mean ratings for the viewed episode for the seven                   other, but we also want to determine if each group differs from the
experimental groups. Our preliminary analyses included data                   Control (i.e., when no recommendation was provided to the users)


                                                                         38
in the same manner. When an artificial High recommendation                 system’s predicted rating). High and Low conditions were
was provided (4.5), ratings were greater than those of the Control         included to learn more about the asymmetry effect observed in
group, but not significantly so (t(58) = 0.997, p = .162). But when        Study 1. In addition to the three treatment groups, a control group
an artificial Low recommendation was provided (1.5), ratings               was included for which no system recommendation was provided.
were significantly lower than those of the Control group (t(56) =          The numbers of participants in the four conditions of the study are
1.796, p = .039). There was an asymmetry of the effect; however,           shown in Table 4 (Section 4.2).
the direction was opposite to that found by Thorsteinson et al.
(2008). To study the effect further, Study 2 was designed to               Based on the TV show rating data collected in Study 1, an online
provide further evidence. So, we will return to the discussion of          system was built for making TV show recommendations in real
the effect later in the paper.                                             time. We compared seven popular recommendation techniques to
                                                                           find the best-performing technique for our dataset.          The
In summary, analyses indicate a moderate-to-strong effect,                 techniques included simple user- and item-based rating average
supporting the Anchoring Hypothesis. When the recommender                  methods, user- and item-based collaborative filtering approaches
system was presented as less reliable, being described as in test          and their extensions [2,4,24], as well as a model-based matrix
phase and providing only tentative recommendations, the effect             factorization algorithm [11,16] popularized by the recent Netflix
size was reduced to a minimal or no effect, in support of the              prize competition [3]. Each technique was evaluated using 10-
Perceived System Reliability Hypothesis. Finally, the Timing               fold cross validation based on the standard mean absolute error
Hypothesis was not supported – the magnitude of the anchoring              (MAE) and coverage metrics. Although the performances are
effect was not different whether the system recommendation was             comparable, the item-based CF performed slightly better than
received before or after the viewing experience. This suggests             other techniques (measured in predictive accuracy and coverage).
that the effect is not attributable to a priming of one’s attitude         Also because the similarities between items could be pre-
prior to viewing. Instead, anchoring is likely to be operating at          computed, the item-based technique performed much faster than
the time the subject is formulating a response.                            other techniques. Therefore the standard item-based collaborative
                                                                           filtering approach was selected for our recommender system.
Overall, viewers, without a system recommendation, liked the
episode (mean = 3.22, where 3 = “Like it”), as is generally found          During the experiments, the system took as input subject’s ratings
with product ratings. However, asymmetry of the anchoring                  of shows that had been seen before or for which the participant
effect was observed at the low end: Providing an artificial low            had indicated familiarity. In real time, the system predicted
recommendation reduced this preference more so than providing a            ratings for all unseen shows and recommended one of the unseen
high recommendation increased the preference. This effect is               shows for viewing. To avoid possible show effects (e.g., to avoid
explored further in Study 2.                                               selecting shows that receive universally bad or good predictions)
                                                                           as well as to assure that the manipulated ratings (1.5 points
4. STUDY 2: IMPACT OF ACTUAL                                               above/below the predicted rating) could still fit into the 5-point
   RECOMMENDATIONS                                                         rating scale, only shows with predicted rating scores between 2.5
Study 2 follows up Study 1 by replacing the artificially fixed             and 3.5 were recommended. When making recommendations, the
anchors with actual personalized recommendations provided by a             system examined each genre in alphabetical order (i.e., comedy
well-known and commonly used recommendation algorithm.                     first, followed by drama, mystery, reality, and sci-fi) and went
Using the user preferences for TV shows collected in Study 1, a            through all unseen shows within each genre alphabetically until
recommender system was designed to estimate preferences of                 one show with a predicted rating between 2.5 and 3.5 was found.
subjects in Study 2 for unrated shows. Because participants                This show was then recommended to the subject. When no show
provide input ratings before being shown any recommendations or            was eligible for recommendation, subjects were automatically re-
other potential anchors, the ratings were unbiased inputs for our          assigned to one of the treatment groups in Study 1.
own recommendation system. Using a parallel design to Study 1,             Our TV show recommender system made suggestions from a list
we examine the Anchoring Hypothesis with a recommender                     of the 105 most popular TV shows that have aired in the recent
system comparable to the ones employed in practice online.                 decade according to a ranking posted on TV.com. Among the 105
                                                                           shows, 31 were available for online streaming on Hulu.com at the
4.1. Methods                                                               time of the study and were used as the pool of shows
197 people completed the study. They were solicited from the               recommended to subjects for viewing. Since our respondents
same paid subject pool as used for Study 1 with no overlap                 rated shows, but viewed only a single episode of a show, we
between the subjects in the two studies. Participants received a           needed a procedure to select the specific episode of a show for
fixed fee upon completion of the study.                                    viewing. For each available show, we manually compared all
In Study 2, the anchors received by subjects were based on the             available episodes and selected the episode that received a median
recommendations of a true recommender system (discussed                    aggregated rating by Hulu.com users to include in the study. This
below). Each subject watched a show that he/she had indicated              procedure maximized the representativeness of the episode for
not having seen before – that was recommended by an actual real-           each show, avoiding the selection of outlying best or worst
time system based on the subject’s individual ratings. Since there         episodes that might bias the participant’s rating. Table 3 shows
was no significant difference observed between subjects receiving          the distributions of rated and viewing-available shows by genre.
system recommendations before or after viewing a show in Study             The procedure was largely identical to the Before and Control
1, all subjects in the treatment groups for Study 2 saw the system-        conditions used for Study 1. However, in Study 2, as indicated
provided rating before viewing.                                            earlier, subjects did not all view the same show. TV episodes
Three levels were used for the recommender system’s rating                 were again streamed from Hulu.com. The episode watched was
provided to subjects in Study 2: Low (i.e., adjusted to be 1.5             either approximately 22 or 45 minutes in duration. For all
points below the system’s predicted rating), Accurate (the                 subjects, the viewing timer was set at 20 minutes, as in Study 1.
system’s actual predicted rating), and High (1.5 points above the          Subjects were instructed that they would not be able to proceed


                                                                      39
until the timer reached zero; at which time they could choose to            the overall analysis of Study 2 at the High and Low ends.
stop and proceed to the next part of the study or to watch the
remainder of the episode before proceeding.                                 To pursue the results further, we recognize that one source of
                                                                            variation in Study 2 as compared to Study 1 is that different shows
                 Table 3. Distribution of Shows.                            were observed by the subjects. As it turns out, 102 of the 198
                                                                            subjects in Study 2 (52%) ended up watching the same Comedy
    Genre                Number of Shows   Available for Viewing
                                                                            show. As a result, we are able to perform post-hoc analyses,
    Comedy               22                7
    Drama                26                8
                                                                            paralleling the main analyses, limited to this subset of viewers.
    Mystery/Suspense     25                4                                The mean (standard deviation) values across the four conditions
    Reality              15                4                                of these subjects for the main response variable are shown in
    Sci Fi and Fantasy   17                8                                Table 5. Using the same response variable of rating drift, the
    Total                105               31                               overall effect across the experimental conditions was marginally
                                                                            maintained (F(2, 77) = 2.70, p = .07. Providing an accurate
                                                                            recommendation still did not significantly affect preferences for
4.2. Results                                                                the show, as compared to the Control condition (two-tailed t(47) =
Since the subjects did not all see the same show, the preference            0.671, p = .506). Consistent with Study 1 and the overall
ratings for the viewed show were adjusted for the predicted                 analyses, the High recommendation condition led to inflated
ratings of the system, in order to obtain a response variable on a          ratings compared to the Low condition (one-tailed t(51) = 2.213, p
comparable scale across subjects. Thus, the main response                   = .016). The effect size was also comparable to the overall effect
variable is the rating drift, which we define as:                           magnitude with Cohen’s d = 0.61, a medium effect size.
         Rating Drift = Actual Rating – Predicted Rating.                   However, for the limited sample of subjects who watched the
                                                                            same episode, the effects at the High and Low end were not
Predicted Rating represents the rating of the TV show watched by
                                                                            symmetric. Compared to receiving an Accurate recommendation,
the user during the study as predicted by the recommendation
                                                                            there was a significant effect of the recommendation being raised
algorithm (before any perturbations to the rating are applied), and
                                                                            (t(52) = 1.847, p = .035, Cohen’s d = .50), but not of being
Actual Rating is the user’s rating value for this TV show after
                                                                            lowered (t(51) = 0.286, p = .388).
watching the episode. Therefore, positive/negative Rating Drift
values represent situations where the user’s submitted rating was            Table 5. Mean(SD) Rating Drift for Subjects Who Watched
higher/lower than the system’s rating, as possibly affected by                          the Same Comedy Show in Study 2.
positive/ negative perturbations (i.e., high/low anchors).
                                                                              Group                           N             Mean (SD)
Similarly to Study 1, our preliminary analyses using general linear           High                            27            0.81 (0.82)
models indicated that none of the variables collected in the survey           Control                         22            0.53 (0.76)
(such as demographics, etc.) demonstrated significance in                     Accurate                        27            0.37 (0.93)
explaining the response variable. The mean (standard deviation)               Low                             26            0.30 (0.86)
values across the four conditions of the study for this variable are
                                                                            Thus, the indicated asymmetry of the anchoring effect is different
shown in Table 4. Using a one-way ANOVA, overall the three
                                                                            from the asymmetry present in Study 1, being at the High end
experimental groups (i.e., High, Low, and Accurate) significantly
                                                                            rather than the Low end. Also, the asymmetry is not robust across
differed (F(2, 147) = 3.43, p = .035).
                                                                            the overall data. Indicated is that the underlying cause of
 Table 4. Mean (SD) Rating Drift of the Viewed TV Show by                   asymmetries is situational, in this case depending upon specific
             Experimental Condition, Study 2.                               TV show effects. When looking at effects across different TV
                                                                            shows (Table 4), the show effects average out and symmetry is
                                            Study 2
                                                                            observed overall. When looking at effects for a particular show
    Group                       N             Mean (SD)
                                                                            (Tables 2 and 5), idiosyncratic asymmetries can arise.
    High                        51            0.40 (1.00)
    Control                     48            0.14 (0.94)                   5. STUDY 3: ACTUAL
    Accurate                    51            0.13 (0.96)
    Low                         47            -0.12 (0.94)                     RECOMMENDATIONS WITH JOKES
                                                                            Study 3 provides a generalization of Study 2 within a different
Providing an accurate recommendation did not significantly affect           content domain, applying a recommender system to joke
preferences for the show, as compared to the Control condition              preferences rather than TV show preferences. As in Study 2, the
(two-tailed t(97) = 0.023, p = .982). Consistent with Study 1, the          procedure uses actual recommendations provided by a commonly
High recommendation condition led to inflated ratings compared              used recommendation algorithm. A within-subjects design also
to the Low condition (one-tailed t(96) = 2.629, p = .005). The              allows us to investigate behavior at an individual level of analysis,
effect size was of slightly less magnitude with Cohen’s d = 0.53, a         rather than in the aggregate. We apply a wider variety of
medium effect size. However, unlike in Study 1, the anchoring               perturbations to the actual recommendations for each subject,
effect in Study 2 is symmetric at the High and Low end. There               ranging from -1.5 to 1.5, the values used in Study 2, rather than
was a marginally significant effect of the recommendation being             just using a single perturbation per subject.
lowered compared to being accurate (t(96) = 1.305, p = .098,
Cohen’s d = .30), and a marginally significant effect at the High           5.1. Methods
end compared to receiving Accurate recommendations (t(100) =                61 people received a fixed fee for completing the study. They
1.366, p = .088, Cohen’s d = .23). Similar effects are observed             were solicited from the same paid subject pool used for Studies 1
when comparing High/Low to Control condition. In summary,                   and 2 with no overlap across the three studies.
the Anchoring Hypothesis is supported in Study 2, consistently
with Study 1. However, the anchoring effects were symmetric in              As with Study 2, the anchors received by subjects were based on
                                                                            the recommendations of a true recommender system. The item-


                                                                       40
based collaborative filtering technique was used to maintain                 analyses. The mean magnitude of the relationship is 0.37, with
consistency with Study 2. The same list of 100 jokes was used                values ranging from -0.27 to 0.87.
during the study, though the order of the jokes was randomized
between subjects. The jokes and the rating data for training the             Overall, the analyses strongly suggest that the effect of
recommendation algorithm were taken from the Jester Online                   perturbations on rating drift is not discrete. Perturbations have a
Joke Recommender System repository [12]. Specifically, we used               continuous effect upon ratings with, on average, a drift of 0.35
their Dataset 2, which contains 150 jokes. To get to our list of             rating points occurring for every rating point of perturbation (e.g.,
100, we removed those jokes that were suggested for removal at               mean rating drift is 0.53 for a perturbation of +1.5).
the Jester website (because they were either included in the
“gauge set” in the original Jester joke recommender system or
because they were never displayed or rated), jokes that more than
                                                                                                                                                                              0.53
one of the coauthors of our study identified as having overly
objectionable content, and finally those jokes that were greatest in                                                                                                  0.28


                                                                                MeanRatingDrift
length (based on word count).                                                                                                    Control
                                                                                                                                  Ͳ0.04                    0.07
The procedure paralleled that used for Study 2 with changes
adapted to the new context. Subjects first evaluated 50 jokes,
                                                                                                    Ͳ1.5           Ͳ1        Ͳ0.5                    0.5          1          1.5
randomly selected and ordered from the list of 100, as a basis for
                                                                                                                                             Ͳ0.20
providing recommendations. The same 5-star rating scale with                                                                     Ͳ0.23
half-star ratings from Studies 1 and 2 was used, affording a 9-                                                          Ͳ0.41
point scale for responses. Next, the subjects received 40 jokes                                            Ͳ0.53
with a predicted rating displayed. Thirty of these predicted
ratings were perturbed, 5 each using perturbations of -1.5, -1.0, -
                                                                                                                        PerturbationofRecommendation
0.5, +0.5, +1.0, and +1.5. The 30 jokes that were perturbed were
determined pseudo-randomly to assure that the manipulated
                                                                             Figure 2. Mean Rating Drift as a Function of the Amount of
ratings would fit into the 5-point rating scale. First, 10 jokes with
                                                                             Rating Perturbation and for Control Condition in Study 3.
predicted rating scores between 2.5 and 3.5 were selected
randomly to receive perturbations of -1.5 and +1.5. From the                     Table 6. Mean (SD) Rating Drift, in the Comparable
remaining, 10 jokes with predicted rating scores between 2.0 and               Conditions Used in Study 2 (±1.5, 0, Control), for Study 3.
4.0 were selected randomly to receive perturbations of -1.0 and                    Group                                                    N                  Mean (SD)
+1.0. Then, 10 jokes with predicted rating scores between 1.5 and                  High                                                     305            0.53 (0.94)
4.5 were selected randomly to receive perturbations of -0.5 and                    Control                                                  320            -0.04 (1.07)
+0.5. Ten predicted ratings were not perturbed, and were                           Accurate                                                 610            -0.20 (0.97)
displayed exactly as predicted. These 40 jokes were randomly                       Low                                                      305            -0.53 (0.95)
intermixed. Following the first experimental session (3 sessions
were used in total), the final 10 jokes were added as a control. A           6. DISCUSSION AND CONCLUSIONS
display was added on which subjects provided preference ratings              We conducted three laboratory experiments and systematically
for the 10 jokes with no predicted rating provided, again in                 examined the impact of recommendations on consumer
random order. Finally in all sessions, subjects completed a short            preferences. The research integrates ideas from behavioral
demographic survey.                                                          decision theory and recommender systems, both from practical
                                                                             and theoretical standpoints. The results provide strong evidence
5.2. Results                                                                 that biased output from recommender systems can significantly
As with Study 2, the main response variable for Study 3 was                  influence the preference ratings of consumers.
Rating Drift (i.e., Actual Rating – Predicted Rating). As an
illustration of the overall picture, Figure 2 shows the mean Rating          From a practical perspective, the findings have several important
Drift, aggregated across items and subjects, for each perturbation           implications. First, they suggest that standard performance
used in the study. In the aggregate, there is a linear relationship          metrics for recommender systems may need to be rethought to
both for negative and positive perturbations. For comparison                 account for these phenomena. If recommendations can influence
purposes, Table 6 shows the mean (standard deviation) values                 consumer-reported ratings, then how should recommender
across the four perturbation conditions of Study 3 that were                 systems be objectively evaluated? Second, how does this
comparable to those used in Study 2 (aggregating across all                  influence impact the inputs to recommender systems? If two
relevant Study 3 responses). The general pattern for Study 3—                consumers provide the same rating, but based on different initial
using jokes and within-subjects design—parallels that for Study              recommendations, do their preferences really match in identifying
2—using TV shows and a between-subjects design.                              future recommendations? Consideration of issues like these arises
                                                                             as a needed area of study. Third, our findings bring to light the
The within-subjects design also allows for analyses of the                   potential impact of recommender systems on strategic practices.
Anchoring Hypothesis at the individual level. We began by                    If consumer choices are significantly influenced by
testing the slopes across subjects between negative and positive             recommendations, regardless of accuracy, then the potential arises
perturbations, and no significant difference was observed (t(60) =           for unscrupulous business practices. For example, it is well-
1.39, two-tailed p = .17). We also checked for curvilinearity for            known that Netflix uses its recommender system as a means of
each individual subject for both positive and negative                       inventory management, filtering recommendations based on the
perturbations. No significant departures from linearity were                 availability of items [26]. Taking this one step further, online
observed, so all reported analyses use only first-order effects. As          retailers could potentially use preference bias based on
an indicator of the magnitude of the effect, we examined the                 recommendations to increase sales.
distribution of the correlation coefficients for the individual


                                                                        41
Further research is clearly needed to understand the effects of             [13] Jacowitz, K.E., and Kahneman, D. 1995. "Measures of
recommender systems on consumer preferences and behavior.                        Anchoring in Estimation Tasks," Personality and Social
Issues of trust, decision bias, and preference realization appear to             Psychology Bulletin, 21, 1161-1166.
be intricately linked in the context of recommendations in online           [14] Johnson, J.E.V., Schnytzer, A., and Liu, S. 2009. "To What
marketplaces. Additionally, the situation-dependent asymmetry                    Extent Do Investors in a Financial Market Anchor Their
of these effects must be explored to understand what situational                 Judgments Excessively?" Evidence from the Hong Kong
characteristics have the largest influence. Moreover, future                     Horserace Betting Market," Journal of Behavioral Decision
research is needed to investigate the error compounding issue of                 Making, 22, 410-434.
anchoring: How far can people be pulled in their preferences if a           [15] Komiak, S., and Benbasat, I. 2006. "The Effects of
recommender system keeps providing biased recommendations?                       Personalization and Familiarity on Trust and Adoption of
Finally, this study has brought to light a potentially significant               Recommendation Agents," MIS Quarterly, 30, (4), 941-960.
issue in the design and implementation of recommender systems.              [16] Koren, Y., Bell, R., and Volinsky, C. 2009. "Matrix
Since recommender systems rely on preference inputs from users,                  Factorization Techniques for Recommender Systems," IEEE
bias in these inputs may have a cascading error effect on the                    Computer Society, 42, 30-37.
performance of recommender system algorithms.                Further        [17] Ku, G., Galinsky, A.D., and Murnighan, J.K. 2006. "Starting
research on the full impact of these biases is clearly warranted.                Low but Ending High: A Reversal of the Anchoring Effect in
                                                                                 Auctions," J. of Personality and Social Psych, 90, 975-986.
7. ACKNOWLEDGMENT                                                           [18] Lichtenstein, S., and Slovic, P. (eds.). 2006. The
This work is supported in part by the National Science Foundation                Construction of Preference. Cambridge: Cambridge
grant IIS-0546443.                                                               University Press.
                                                                            [19] Mcnee, S.M., Lam, S.K., Konstan, J.A., and Riedl, J. 2003.
REFERENCES                                                                       "Interfaces for Eliciting New User Preferences in
[1] Adomavicius, G., and Tuzhilin, A. 2005. "Toward the Next                     Recommender Systems," in User Modeling 2003,
     Generation of Recommendation System: A Survey of the                        Proceedings. Berlin: Springer-Verlag Berlin, 178-187.
     State-of-the-Art     and    Possible    Extensions,"   IEEE            [20] Mussweiler, T., and Strack, F. 2000. "Numeric Judgments
     Transactions on Knowledge and Data Engineering, 17, (6),                    under Uncertainty: The Role of Knowledge in Anchoring,"
     734-749.                                                                    Journal of Experimental Social Psychology, 36, 495-518.
[2] Bell, R.M., and Koren, Y. 2007. "Improved Neighborhood-                 [21] Northcraft, G., and Neale, M. 1987. "Experts, Amateurs, and
     Based Collaborative Filtering," KDD Cup'07, San Jose,                       Real Estate: An Anchoring-and-Adjustment Perspective on
     California, USA, 7-14.                                                      Property Pricing Decisions," Organizational Behavior and
[3] Bennett, J., and Lanning, S. 2007. "The Netflix Prize," KDD-                 Human Decision Processes, 39, 84-97.
     Cup and Workshop, San Jose, CA, www.netflixprize.com.                  [22] Pu, P., and Chen, L. 2007. "Trust-Inspiring Explanation
[4] Breese, J.S., Heckerman, D., and Kadie, C. 1998. "Empirical                  Interfaces for Recommender Systems," Knowledge-Based
     Analysis of Predictive Algorithms for Collaborative                         Systems, 20, (6), Aug, 542-556.
     Filtering," 14th Conf. on Uncertainty in Artificial                    [23] Russo, J.E. 2010. "Understanding the Effect of a Numerical
     Intelligence, Madison, WI.                                                  Anchor," Journal of Consumer Psychology, 20, 25-27.
[5] Chapman, G., and Bornstein, B. 1996. "The More You Ask                  [24] Sarwar, B., Karypis, G., Konstan, J.A., and Riedl, J. 2001.
     for, the More You Get: Anchoring in Personal Injury                         "Item-Based Collaborative Filtering Recommendation
     Verdicts," Applied Cognitive Psychology, 10, 519-540.                       Algorithms," 10th International WWW Conference, Hong
[6] Chapman, G., and Johnson, E. 2002. "Incorporating the                        Kong, 285 - 295.
     Irrelevant: Anchors in Judgments of Belief and Value.," in             [25] Schonfeld, E. July 2007. "Click Here for the Upsell."
     Heuristics and Biases: The Psychology of Intuitive Judgment,                CNNMoney.com,                                           from
     T. Gilovich, D. Griffin and D. Kahneman (eds.). Cambridge:                  http://money.cnn.com/magazines/business2/business2_archiv
     Cambridge University Press, 120-138.                                        e/2007/07/01/100117056/index.htm.
[7] Cosley, D., Lam, S., Albert, I., Konstan, J.A., and Riedl, J.           [26] Shih, W., S., K., and Spinola, D. 2007. "Netflix," Harvard
     2003. "Is Seeing Believing? How Recommender Interfaces                      Business School Publishing, (case number 9-607-138).
     Affect Users’ Opinions," CHI 2003 Conference, Fort                     [27] Strack, F., and Mussweiler, T. 1997. "Explaining the
     Lauderdale FL.                                                              Enigmatic Anchoring Effect: Mechanisms of Selective
[8] Deshpande, M., and Karypis, G. 2004. "Item-Based Top-N                       Accessibility," Journal of Personality and Social
     Recommendation Algorithms," ACM Trans. Information                          Psychology, 73, 437-446.
     Systems, 22, (1), 143-177.                                             [28] Swearingen, K., and Sinha, R. 2001. "Beyond Algorithms:
[9] Epley, N., and Gilovich, T. 2010. "Anchoring Unbound," J.                    An Hci Perspective on Recommender Systems," ACM SIGIR
     of Consumer Psych, 20, 20-24.                                               2001 Workshop on Recommender Systems, New Orleans,
[10] Flynn, L.J. January 23, 2006. "Like This? You'll Hate That.                 Louisiana.
     (Not All Web Recommendations Are Welcome.)." New York                  [29] Thorsteinson, T., Breier, J., Atwell, A., Hamilton, C., and
     Times,                                                  from                Privette, M. 2008. "Anchoring Effects on Performance
     http://www.nytimes.com/2006/01/23/technology/23recomme                      Judgments," Organizational Behavior and Human Decision
     nd.html.                                                                    Processes, 107, 29-40.
[11] Funk, S. 2006. "Netflix Update: Try This at Home." 2010,               [30] Tversky, A., and Kahneman, D. 1974. "Judgment under
     from http://sifter.org/~simon/journal/20061211.html.                        Uncertainty: Heuristics and Biases," Science, 185, 1124-
[12] Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. 2001.                  1131.
     "Eigentaste: A Constant Time Collaborative Filtering
     Algorithm," Information Retrieval, 4, (2), 133-151.


                                                                       42