<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>De-Biasing User Preference Ratings in Recommender Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gediminas Adomavicius</string-name>
          <email>gedas@umn.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jesse Bockstedt</string-name>
          <email>bockstedt@email.arizona</email>
          <email>bockstedt@email.arizona .edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shawn Curley</string-name>
          <email>curley@umn.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jingjing Zhang</string-name>
          <email>jjzhang@indiana.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indiana University</institution>
          ,
          <addr-line>Bloomington, IN</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Arizona</institution>
          ,
          <addr-line>Tucson, AZ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Minnesota</institution>
          ,
          <addr-line>Minneapolis, MN</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Prior research has shown that online recommendations have significant influence on users' preference ratings and economic behavior. Specifically, the self-reported preference rating (for a specific consumed item) that is submitted by a user to a recommender system can be affected (i.e., distorted) by the previously observed system's recommendation. As a result, anchoring (or anchoring-like) biases reflected in user ratings not only provide a distorted view of user preferences but also contaminate inputs of recommender systems, leading to decreased quality of future recommendations. This research explores two approaches to removing anchoring biases from self-reported consumer ratings. The first proposed approach is based on a computational post-hoc de-biasing algorithm that systematically adjusts the user-submitted ratings that are known to be biased. The second approach is a user-interface-driven solution that tries to minimize anchoring biases at rating collection time. Our empirical investigation explicitly demonstrates the impact of biased vs. unbiased ratings on recommender systems' predictive performance. It also indicates that the post-hoc algorithmic debiasing approach is very problematic, most likely due to the fact that the anchoring effects can manifest themselves very differently for different users and items. This further emphasizes the importance of proactively avoiding anchoring biases at the time of rating collection. Further, through laboratory experiments, we demonstrate that certain interface designs of recommender systems are more advantageous than others in effectively reducing anchoring biases.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Recommender systems</kwd>
        <kwd>anchoring effects</kwd>
        <kwd>rating de-biasing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Recommender systems are prevalent decision aids in the
electronic marketplace, and online recommendations significantly
impact the decision-making process of many consumers. Recent
studies show that online recommendations can manipulate not
only consumers’ preference ratings but also their willingness to
pay for products [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ]. For example, using multiple experiments
with TV shows, jokes and songs, prior studies found evidence that
a recommendation provided by an online system serves as an
anchor when consumers form their preference for products, even
at the time of consumption [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Furthermore, using the
systempredicted ratings as a starting point and biasing them (by
perturbing them up or down) to varying degrees, this anchoring
effect was observed to be continuous, with the magnitude
proportional to the size of the perturbation of the recommendation
in both positive and negative directions – about 0.35-star effect
for each 1-star perturbation on average across all users and items
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Additionally, research found that recommendations displayed
to participants significantly pulled their willingness to pay for
items in the direction of the recommendation, even when
controlling for participants’ preferences and demographics [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Based on these previous studies, we know that users’
preference ratings can be significantly distorted by the
systempredicted ratings that are displayed to users. Such distorted
preference ratings are subsequently submitted as users’ feedback
to recommender systems, which can potentially lead to a biased
view of consumer preferences and several potential problems
[
        <xref ref-type="bibr" rid="ref1 ref5">1,5</xref>
        ]: (i) biases can contaminate the recommender system’s
inputs, weakening the system’s ability to provide high-quality
recommendations in subsequent iterations; (ii) biases can
artificially pull consumers’ preferences towards displayed system
recommendations, providing a distorted view of the system’s
performance; (iii) biases can lead to a distorted view of items
from the users’ perspectives. Thus, when using recommender
systems, anchoring biases can be harmful to system’s use and
value, and the removal of anchoring biases from consumer ratings
constitutes an important and highly practical research problem.
      </p>
      <p>In this research, we focus on the problem of “de-biasing”
selfreported consumer preference ratings for consumed items. We
first empirically demonstrate that the use of unbiased preference
ratings as inputs indeed leads to higher predictive accuracy of
recommendation algorithms than the use of biased preference
ratings. We then propose and investigate two possible approaches
to tackle the rating de-biasing problem:
1) Post-hoc rating adjustment (reactive approach): a
computational approach that attempts to adjust the
usersubmitted ratings by taking into account the system
recommendation observed by the user.
2) Bias-aware interface design for rating collection (proactive
approach): a design-based approach that employs a user
interface for rating collection by presenting recommendations
in a way that eliminates or reduces anchoring effects.</p>
    </sec>
    <sec id="sec-2">
      <title>2. BACKGROUND</title>
      <p>
        Prior literature has investigated how the cues provided by
recommender systems influence online consumer behavior. For
example, Cosley et al. (2003) found that users showed high
testretest consistency when being asked to re-rate a movie with no
prediction provided [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, when users were asked to
rerate a movie while being shown a “predicted” rating that was
altered upward or downward from their original rating by a single
fixed amount of one rating point (i.e., providing a high or low
anchor), users tended to give higher or lower ratings, respectively,
as compared to a control group receiving accurate original ratings.
This showed that anchoring could affect users’ ratings based on
preference recall, for movies seen in the past and now being
evaluated.
      </p>
      <p>
        Adomavicius et al. (2013) looked at a similar effect in an even
more controlled setting, in which the consumer preference ratings
for items were elicited at the time of item consumption [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Even
without a delay between consumption and elicited preference,
anchoring effects were observed. The displayed predicted ratings,
when perturbed to be higher or lower, affected the submitted
consumer ratings to move in the same direction.
      </p>
      <p>
        Prior research also found that recommendations not only
significantly affect consumers’ preference ratings but also their
economic behavior [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Researchers present the results of two
controlled experiments in the context of purchasing digital songs.
The studies found strong evidence that randomly assigned song
recommendations affected participants’ willingness to pay, even
when controlling for participants’ preferences and demographics.
Similar effects on willingness to pay were also observed when
participants viewed actual system-generated recommendations
that were intentionally perturbed up or down (introducing
recommendation error).
      </p>
      <p>The anchoring biases occurring due to system-generated
recommendations can potentially lead to several issues. From the
consumers’ perspective, anchoring biases can distort (or
manipulate) consumers’ preferences and economic behavior, and
therefore lead to suboptimal product choices and distorted
preference ratings. From the retailer’s perspective (e.g., Amazon,
eBay), anchoring biases may allow third-party agents to
manipulate the recommender system (e.g., by strategically adding
malicious ratings) so that it operates in their favor. This would
reduce consumers’ trust in the recommender system and harm the
success of the system in the long term. From the system
designers’ perspective, the distorted user preference ratings that
are subsequently submitted as consumers’ feedback to
recommender systems can contaminate the inputs of the
recommender system, reducing its effectiveness. Therefore,
removing the bias of recommendations represents an important
research question. In the following sections, we empirically study
two possible approaches for tackling the rating de-biasing
problem.</p>
    </sec>
    <sec id="sec-3">
      <title>3. APPROACH I:</title>
    </sec>
    <sec id="sec-4">
      <title>POST-HOC RATING ADJUSTMENT</title>
    </sec>
    <sec id="sec-5">
      <title>3.1 Rating Adjustment Algorithm</title>
      <p>
        The underlying intuition of post-hoc rating adjustment is to
“reverse-engineer” consumers’ true non-biased ratings from the
user-submitted ratings and the displayed system recommendations
(that were observed by the users). For this, we use the
information established by previous research that, in aggregate,
the anchoring effect of online recommendations is linear and
proportional to the size of the recommendation perturbation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
As depicted in Fig 1, the deviation of the submitted rating from
the user’s unbiased rating (i.e., Dev) should be proportional to the
deviation of the system’s displayed prediction from the user’s
unbiased rating (i.e., α × Dev). Given the user’s submitted rating,
the displayed system prediction, and the expected anchoring effect
size, we develop a computational rule to systematically
reverseengineer user’s unbiased ratings.
      </p>
      <p>Fig 1. Post-Hoc Rating Adjustment Illustration</p>
      <p>Mathematically, let α be the expected slope (i.e.,
proportionality coefficient) of the bias relative to the size of rating
perturbation, be the value of the system’s predicted rating
on item i that was shown to user u, and be the user’s
submitted rating after seeing the system’s prediction. We estimate
the unbiased rating of user u for item i, i.e., using
the formula below:
⁄1 .</p>
      <p>In this post-hoc adjustment approach, the value of α is
determined by the observed slope of the bias and can range
between 0 (inclusive) and 1 (exclusive). Varying the size of α
within [0, 1) changes the degree of rating adjustment, i.e., a larger
value of α leads to a larger adjustment to the submitted rating,
while α = 0 means no adjustment is made. In our experiments, the
slope α can be either a global constant that applies to all users and
items, or user-specific values determined by an individual user’s
tendency of anchoring on the system’s recommendations.</p>
    </sec>
    <sec id="sec-6">
      <title>3.2 Computational Experiments</title>
      <sec id="sec-6-1">
        <title>3.2.1 Joke Rating Dataset</title>
        <p>
          Our experiments use a Joke rating dataset collected in
laboratory settings by a prior study on anchoring effects of
recommender systems [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The dataset includes ratings provided
by 61 users on 100 jokes. At the beginning of the study,
participants first evaluated 50 jokes without seeing a system’s
recommendations. These initial ratings reflect user’s unbiased
preferences and were used as a basis for computing the system’s
predictions. Next, the participants received 40 jokes with a
predicted rating displayed. Among them, thirty of these predicted
ratings were perturbed to various degrees and ten were not
perturbed. These 40 jokes were randomly intermixed.
        </p>
        <p>Prior research has observed continuous and linear anchoring
effects on this joke rating dataset. On average, the anchoring
slope across all users and items is α = 0.35, and is significantly
positive. Individual linear regression models were also obtained
at an individual-user level. These user-specific regression slopes
are predominately positive, suggesting that significant anchoring
bias was observed for most participants.</p>
        <p>For the post-hoc de-biasing experiments, we partition the joke
ratings for each user into two subsets. The first subset contains
the initial 50 ratings provided by each user before seeing any
system recommendations (i.e., unbiased), and the second subset
contains the subsequent 40 user ratings submitted after user
received system’s recommendations with various levels of
perturbations (i.e., biased ratings). Next, on the 40 biased ratings,
we apply the post-hoc rating adjustment rule to remove possible
anchoring biases to recover users’ unbiased ratings.</p>
        <p>To evaluate the benefits of post-hoc rating adjustment, we
compute predictive accuracy (measured as Root Mean Squared
Error, i.e., RMSE) of standard recommendation algorithms using
the adjusted ratings (i.e., de-biased) as training data and the initial
ratings (i.e., unbiased) as testing data. We then compare this
accuracy performance with that of using actual submitted ratings
(i.e., biased) as training data and the same initial ratings as testing
data. If rating de-biasing is successful, the prediction accuracy on
“de-biased” ratings should be better than accuracy on “biased”
ratings. We explore the post-hoc rating adjustment under a variety
of settings, as described below.</p>
      </sec>
      <sec id="sec-6-2">
        <title>3.2.2 Experiments</title>
        <p>Our first experiment investigated the accuracy performance on
unbiased, biased, and de-biased ratings adjusted based on various
rules and statistically compared their differences. First, we
randomly divided the 50 initial (unbiased) ratings provided by
each user into two equal subsets with 25 ratings per user
(aggregated across all users) in each subset. We used one subset
as the training data to build the model and evaluated the model’s
predictive accuracy on the other subset (i.e., the testing set).
Because both training and testing data are comprised of unbiased
ratings submitted by users without seeing any system prediction,
the accuracy performance computed based on these initial ratings
would provide us the upper bound of accuracy performance for
each recommendation algorithm.</p>
        <p>
          We then selected 25 random ratings from the set of 40 biased
submissions for each user and used them as inputs to re-build the
recommendation model. The model’s predictive accuracy was
evaluated on the same exact testing set (i.e., 25 unbiased ratings
from each user). Next we adjusted these 25 biased ratings using
either the suggested global slope of α = 0.35 or user-specific
adjustment slopes. When a global adjustment is used, the ratings
submitted by all users are adjusted using the same global slope α.
In contrast, when a user-specific adjustment is used, we first
estimate the regression slope for each user u based on the
user’s experimental data. If the estimated slope is significant
(i.e., p &lt;= 0.05), we use to adjust the ratings provided by the
given user. Each user hence has a unique adjustment slope.
Finally, we computed the predictive accuracy using these 25
debiased ratings as training data. The predictive accuracy of rating
samples was computed for several well-known recommendation
algorithms, including a simple global baseline heuristic (i.e.,
Baseline) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], the matrix factorization approach (i.e., SVD) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ],
and user- and item-based collaborative filtering algorithms (i.e.,
CF_User and CF_Item) [
          <xref ref-type="bibr" rid="ref10 ref7">7,10</xref>
          ].
        </p>
        <p>In our experiment we repeated the above steps 30 times and
extracted different random samples each time. We report the
average accuracy performances based on unbiased, biased, and
de-biased ratings in Table 1. The training data resulting in best
performance for each recommendation method is indicated in
boldface.</p>
        <p>As seen in Table 1, the initial (unbiased) ratings provide the
best accuracy performance for all recommendation algorithms,
clearly demonstrating the advantage of unbiased ratings over
biased ratings on recommender systems’ predictive performance.
Most of the accuracy comparisons in the table are statistically
significant (p &lt; 0.05). The only two exceptions are the contrasts
between de-biased ratings based on global and user-specific
slopes for Baseline and SVD. The results suggest that the use of
unbiased preference ratings as inputs indeed leads to significantly
higher predictive accuracy of recommendation algorithms than the
use of biased preference ratings. In addition, the de-biased ratings
(adjusted based on either global or user-specific slopes) did not
provide accuracy benefits. Adjusted ratings based on
userspecific slopes lead to slightly better accuracy than ratings
adjusted based on the global slope of α = 0.35. However, neither
of the two post-hoc de-biasing adjustments was helpful in
improving accuracy. These patterns are consistent across various
popular recommendation algorithms described in Table 1.</p>
        <p>In the second experiment, we explored different de-biasing
slope values for user ratings and computed predictive accuracy on
the entire rating dataset (as opposed to randomly chosen rating
samples as in first experiment). Specifically, we took all 40
biased ratings submitted by users after seeing the system’s
predictions and adjusted these ratings using the post-hoc
debiasing rule. All of these 40 “de-biased” ratings were then used as
training data to compute predictions using standard
recommendation algorithms, and the predictive accuracy was
evaluated on the initial 50 unbiased ratings. We varied the
debiasing slopes and explored both global and user-specific
adjustments.</p>
        <p>Fig 2 summarizes the predictive accuracy performance on
ratings de-biased based on different adjustment slope parameters.
When the slope value is equal to zero, it means no adjustment was
made, i.e., the user’s actual submitted ratings (biased) were used
as training data for the recommendation algorithms. The vertical
black line on the left side corresponds to the accuracy
performance of various algorithms with these actual-submitted
ratings (i.e., biased) as training data. In addition to exploring
different global adjustment slopes, we also experimented with
user-specific adjustments as indicated by the vertical black line on
the right side.</p>
        <p>Fig 2. Predictive accuracy of de-biased ratings, with varying
adjustment slopes.</p>
        <p>
          Based on our experimental results, using users' actual
submitted ratings (i.e., no adjustment) provided better accuracy
performance than using de-biased ratings adjusted to any degree.
As we increase the size of the global adjustment slope, the
predictive accuracy performance estimated on test ratings
decreases monotonically. Additionally, although the resulting
accuracy of a user-specific adjustment is slightly better than that
of the global slope of α = 0.35 suggested in prior research, the
user-specific adjustment still did not yield better accuracy than no
adjustment or small global adjustments. Overall, our experiment
was unable to achieve any predictive accuracy improvements by
de-biasing consumer ratings with either a global de-biasing rule
based on a single slope parameter or the individual user-level
rules based on user-specific slope parameters. We also conducted
additional experiments with a variety of settings of post-hoc rating
adjustment. For example, we introduced a tolerance threshold and
only adjusted a submitted rating when it differs from the system’s
predicted rating by more than a certain amount (e.g., 0.5 stars).
We also rounded de-biased ratings to various rating scales (e.g., to
half stars, or to the first decimal place). We further experimented
with adjusting only the positively biased ratings or only the
negatively biased ratings to compare accuracy improvements. In
addition, we empirically explored post-hoc rating de-biasing with
a real-world movie rating dataset provided by Netflix [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          However, based on our empirical explorations with these
various post-hoc de-biasing methods, we have not been able to
achieve any recommendation accuracy improvements by
debiasing consumer ratings with a global rule based on a single
slope parameter (as demonstrated by Fig 2, we also explored other
possible de-biasing slope values in addition to the empirically
observed 0.35 value) or with a user-specific slope-based
debiasing rule. This indicates that, once the biased ratings are
submitted, “reverse-engineering” is a difficult task. More
specifically, while previous research was able to demonstrate that,
in aggregate, there exist clear, measurable anchoring effects, it is
highly likely that each individual anchoring effect (i.e., for a
specific user/item rating) could be highly irregular – the biases
could be user-dependent, item-dependent, context-dependent, and
may have various types of other interaction effects. In fact,
previous research provides some evidence to support such
irregularity and situation-dependency. For example, prior studies
observed symmetric (i.e., both positive and negative, equally
pronounced) anchoring biases when they were aggregated across
many items and asymmetric anchoring biases when they were
tested on one specific item [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>Therefore, an alternative approach to rating de-biasing would
be to eliminate anchoring biases at rating-collection time through
a carefully designed user interface. We discuss experiments with
various interfaces in the next section.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>4. APPROACH II:</title>
    </sec>
    <sec id="sec-8">
      <title>BIAS-AWARE INTERFACE DESIGN</title>
      <p>The bias-aware interface design approach focuses on
proactively preventing anchoring biases from occurring rather
than trying to eliminate them after they have already occurred.
We use a laboratory experiment to investigate various rating
representation forms that may reduce anchoring effects at the
rating collection stage. Besides the recommendation display, all
other elements of the user interface were controlled to be
equivalent across all experimental conditions. Our experiments
explored seven different recommendation displays. Among them,
four display designs were based on two main factors: (i)
information representation (numeric vs. graphical ratings); and (ii)
vagueness of recommendation (precise vs. vague rating values).
Another two displays simulate popular star-rating representations
used in many real-world recommender systems: stars-only and
star along with a numeric rating. The seventh interface we
explored was a binary design where only “thumbs up (down)” are
displayed for high (low) predictions. Table 2 summarizes the
seven rating representation options (i.e., Binary, Graphic-Precise,
Graphic-Vague, Numeric-Precise, Numeric-Vague, Star-Numeric,
and Star-Only).</p>
    </sec>
    <sec id="sec-9">
      <title>4.1 Experiment Procedure</title>
      <p>
        A database of 100 jokes was used for the study, with the order
of the jokes randomized across participants. The jokes and the
rating data for training the recommendation algorithm were taken
from the Jester Online Joke Recommender System repository, a
database of jokes and preference data maintained by the Univ. of
California, Berkeley (http://eigentaste.berkeley.edu/dataset) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
The well-known item-based collaborative filtering technique was
used to implement a recommender system that estimates users’
preference ratings for the jokes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The study was conducted at
a behavioral research lab at a large North American university,
and participants were recruited from the university’s research
participant pool. In total 287 people completed the study for a
fixed participation fee.
      </p>
      <p>Upon logging in, participants were randomly assigned to one
of the seven treatment groups. Subjects in different treatment
groups saw different displays of predicted rating. Examples of the
display and number of participants in each treatment group are
provided in Table 2.</p>
      <p>The experimental procedure consisted of three tasks, all of
which were performed using a web-based application on personal
computers with dividers, providing privacy between participants.</p>
      <p>Task 1. In the first task, each participant was asked to
provide his/her preference ratings for 50 jokes randomly selected
from the pool of 100 jokes. Ratings were provided using a scale
from one to five stars with half-star increments, having the
following verbal labels: * = “Hate it”, ** = “Don’t like it”, *** =
“Like it”, **** = “Really like it”, and ***** = “Love it”. For
each joke, we also asked participants to indicate whether they
have heard the joke before. The objective of this joke-rating task
was to capture joke preferences from the participants. Based on
ratings provided in this task, predictions for the remaining unrated
50 jokes were computed.</p>
      <p>Task 2. In the second task, from the remaining unrated 50
jokes, participants were presented with 25 jokes (using 5
recommendation conditions with 5 jokes each) along with a rating
recommendation for each joke and 5 jokes without a
recommendation (as a control condition). The recommendation
conditions are summarized below:
• High-Artificial: randomly generated high recommendation
between 3.5 and 4.5 stars (drawn from a uniform distribution)
• Low-Artificial: randomly generated low recommendation
between 1.5 and 2.5 stars (drawn from a uniform distribution)
• High-Perturbed: algorithmic predictions were perturbed
upward by 1 star
• Low-Perturbed: algorithmic predictions were perturbed
downward by 1 star
• Accurate: actual algorithmic predictions (i.e., not perturbed)
• Control: no recommendation to act as a control</p>
      <p>We first selected 5 jokes for the High-Perturbed condition and
5 jokes for the Low-Perturbed condition. These 10 jokes were
chosen pseudo-randomly to assure that the manipulated ratings
would fit into the 5-point rating scale. Among the remaining
jokes we randomly selected 15 jokes and assigned them to three
groups: 5 to Accurate, 5 to High-Artificial and 5 to
LowArtificial. 5 more jokes were added as a control with no predicted
system rating provided. The 25 jokes with recommendations were
randomly ordered and presented on five consecutive webpages
(with 5 displayed on each page). The 5 control jokes were
presented on the subsequent webpage. Participants were asked to
provide their preference ratings for all these 30 jokes on the same
5-star rating scale.</p>
      <p>Task 3. As the third task, participants completed a short
survey that collected demographic and other individual
information for use in the analyses.</p>
    </sec>
    <sec id="sec-10">
      <title>4.2 Analysis and Results</title>
      <p>The Perturbed vs. Artificial within-subjects manipulation
described above represents two different approaches to the study
of recommendation system bias. The Artificial recommendations
provide a view of bias that controls for the value ranges shown,
manipulating some to be high and some low, while not accounting
for individual differences in preferences in providing the
recommendations. The Perturbed recommendations control for
such possible preference differences, allowing a view of
recommendation error effects. We analyze the results from each
of these approaches separately. First, we test different rating
presentations with artificially (i.e. randomly) generated
recommendations (i.e., not based on users’ preferences).</p>
      <sec id="sec-10-1">
        <title>4.2.1 Artificial Recommendations</title>
        <p>Fig 3 presents a plot of the aggregate means of user-submitted
ratings for each of the treatment groups when high and low
artificial recommendations were provided. As can be seen in the
figure, low artificial recommendations pull down user’s
preference ratings relative to the control, and the high artificial
recommendations tend to increase user’s preference ratings. As
an initial analysis, for each rating display we performed pairwise
t-tests to compare user submitted ratings after receiving high and
low artificial recommendations. The t-test results are presented in
Table 3.</p>
        <p>Mean User Rating (Bars are One Standard Error)
3.4
g3.2
n
it
a3.0
R
r
e
s
U2.8
e
g
a
re2.6
v
A
2.4
2.2</p>
        <p>H C L
Binary</p>
        <p>H C L H C L H C L H C L H C L
Graphic-Precise Graphic-Vague Numeric-Precise Numeric-Vague Star-Number
H C L
Star-Only
Fig 3. Mean and standard deviation of user submitted ratings
after receiving high artificial (High: red dot), low artificial
(Low: green triangle), or no recommendations (Control: black
square).</p>
        <p>All comparisons between High and Low conditions are
significant across the seven rating representations (one-tailed
pvalue &lt; 0.001 for all High vs. Low tests), showing a clear, positive
effect of randomly-generated recommendations on consumers’
preference ratings. All effect sizes are large (Cohen’s d values
range between 0.71 and 1.23). The control condition
demonstrated intermediate preference ratings, showing a
statistically significant difference from the both High and Low
conditions for the majority of the rating display options. This
analysis demonstrates that the anchoring bias of artificial
recommendations exists in all rating displays examined in our
experiment. In other words, we found that none of the seven
rating display options could completely remove the anchoring
biases generated by recommendations.</p>
        <p>We further compare the anchoring bias size of different rating
display options. We computed rating differences between High
and Low conditions and performed one-way ANOVA to test the
overall group difference. Our results suggest significant
difference in effect sizes among different rating representations
(F(6, 280) = 2.24, p &lt; 0.05). Since the overall effect was
significant, we next performed regression analysis to explore the
difference in anchoring bias between different rating display
options, while controlling for participant-level factors.</p>
        <p>In our regression analysis, we created a panel from the data.
The repeated-measures design of the experiment, wherein each
participant was exposed to both high and low artificial
recommendations in a random fashion, allows us to model the
aggregate relationship between shown ratings and user’s
submitted ratings while controlling for individual participant
differences. The standard OLS model using robust standard
errors, clustered by participant, and using participant-level
controls represents our model for the analysis.</p>
        <p>UserRatingij = b0 + b1(Groupi) + b2(Highij) + b3(Groupi× Highij) +
b4(ShownRatingNoiseij) + b5(PredictedRatingij) + b6(Controls) +
ui + εij</p>
        <p>
          In the regression equation shown above, UserRatingij is the
submitted rating for participant i on joke j, Groupi is the rating
display option shown to participant i, Highij indicates whether the
shown rating for participant i on joke j is a high or low artificial
recommendations, ShownRatingNoiseij is a derived variable that
captures the deviation between shown rating for participant i on
joke j and the expected rating value in the corresponding
condition. Specifically, it is computed by either subtracting 4.0
from the shown rating if it is in the high artificial condition or by
subtracting 2.0 from the shown rating if it is in the low artificial
condition. PredictedRatingij is the predicted recommendation star
rating for participant i on joke j, and Controls is a vector of joke
and consumer-related variables for participant i. The controls
included in the model were the joke’s funniness (average joke
rating in the Jester dataset, continuous between 0 and 5),
participant gender (binary), age (integer), whether they are native
speakers of English (yes/no binary), whether they thought
recommendations in the study were accurate (interval five point
scale), whether they thought the recommendations were useful
(interval five point scale), and their self-reported numeracy levels
reflecting participants’ beliefs about their mathematical skills as a
perceived cognitive ability using a scale of four items developed
and validated by prior research [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] (continuous between 4 and 24).
The latter information was collected in order to check for possible
relationships between individual’s subjective numeracy
capabilities and individual’s susceptibility to anchoring biases due
to numeric vs. graphical rating displays. As the study utilized a
repeated-measures design with a balanced number of observations
on each participant, to control for participant-level heterogeneity
the composite error term (ui + εij) includes the individual
participant effect ui and the standard disturbance term εij.
        </p>
        <p>
          The Numeric-Precise rating display condition was chosen to
be the baseline rating representation to compare with the other six
options. We chose Numeric-Precise for two reasons. First it is a
popular rating display used in many real-world recommender
systems of large e-commerce websites such as Amazon, eBay and
Netflix. Second, the Numeric-Precise rating display option was
used by previous experiments in literature [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and was found to
lead to substantial anchoring biases in consumers’ preference
ratings. Therefore in our analysis we compare Numeric-Precise
with other alternative rating display options to examine whether
other rating representations can reduce the observed biases.
        </p>
        <p>We ran three regression models with high artificial only, low
artificial only, and both high and low artificial recommendations.
Note when only high or low recommendations were included for
analysis, the model omitted the High variable and its related
interaction terms. Table 4 presents the estimated coefficients and
standard errors for the three regression models. All models
utilized robust standard error estimates. The regression analysis
controls for both participant and joke level factors as well as the
participant’s predicted preferences for the product being
recommended.</p>
        <p>Our analysis found randomly-generated recommendations
displayed in Numeric-Precise format can substantially affect
consumers’ preference ratings, as indicated by significant
coefficients for Anchoring and ShownRatingNoise in all three
models. More importantly, we found significant negative
interaction effects between multiple rating display options and
anchoring (Model 3). The results clearly indicate that there are
significant differences in anchoring biases between
NumericPrecise and other rating display options. Specifically, we
observed that groups including Binary, Graphic-Precise,
GraphicValue, and Star-Only, when compared to Numeric-Precise, can
generate much lower biases in consumers’ preference ratings. All
the corresponding interaction terms have negative coefficients
with p-values smaller than 0.05. On the other hand, the
interaction terms for Numeric-Vague and Star-Numeric were not
significant, suggesting that these two display options lead to
similar levels of anchoring biases as Numeric-Precise.</p>
        <p>Overall, the Model 3 results suggest that, among all seven
experimental rating display conditions, when randomly-assigned
recommendations are presented in any non-numeric format
(including Binary, Graphic-Precise, Graphic-Vague, Star-Only),
they will generate much smaller anchoring biases compared to the
same recommendations displayed in numeric formats such as
Numeric-Precise, Numeric-Vague and Star-Numeric. In other
words, the information representation of recommendations (e.g.,
numeric vs. non-numeric) largely determines the size of bias in
consumers’ preferences. Introducing vagueness to
recommendations did not seem to reduce the anchoring bias when
compared to the Numeric-Precise baseline (i.e., interaction
between Numeric-Vague and anchoring is insignificant).</p>
        <p>In a follow-up regression analysis (Table 5), we focused on
four rating displays (i.e., Numeric-Precise, Numeric-Vague,
Graphic-Precise, and Graphic-Vague) and similarly found the
interaction between information presentation and anchoring (i.e.,
Numeric × Anchoring) was significant while the interaction
between vagueness and anchoring (i.e., Precise × Anchoring) was
not significant. This further confirms that the anchoring bias can
be reduced by presenting recommendations in graphical forms
rather than numeric forms. Anchoring bias, however, cannot be
reduced by presenting the recommendations as vague rating
ranges (as opposed to precise values).</p>
        <p>In addition, Model 1 focuses on high artificial
recommendations (Table 4) and demonstrates significantly
smaller anchoring biases for Binary, Graphic-Vague,
NumericVague and Star-Only displays, when compared to the
NumericPrecise display as the baseline. Model 2 focuses on low artificial
recommendations and suggests that Graphic-Precise displays
generated smaller biases compared to the baseline when
recommendations were low. Therefore, another finding from
Models 1 and 2 is that the “bias-reducing” effects of many rating
display options can be highly asymmetric and depend on
contextual factors such as the actual value of the recommendation.</p>
        <p>Among the secondary factors, predicted consumer
preferences, joke funniness, and perceived accuracy of
recommendations all had consistently significant effects across all
models. Therefore, controlling for these factors in the regression
model was warranted.</p>
      </sec>
      <sec id="sec-10-2">
        <title>4.2.2 Perturbed Recommendations</title>
        <p>As an extension to a more realistic setting and as a robustness
check, we next examine whether anchoring biases generated by
perturbations in real recommendations from an actual
recommender system can be eliminated by certain rating display
options. Recall that participants received recommendations that
were perturbed either upward (High-Perturbed) or downward
(Low- Perturbed) by 1 star from the actual predicted ratings. As a
control, each participant also received recommendations without
perturbations (Accurate). Consumers’ submitted ratings for the
jokes were adjusted for the predicted ratings in order to obtain a
response variable on a comparable scale across subjects. Thus,
the main response variable is the rating drift, which we define as:</p>
        <p>RatingDrift = UserRating – PredictedRating</p>
        <p>Fig 4 is a plot of the aggregate means of rating drift for each
treatment group when recommendations were perturbed to be
higher or lower or received no perturbation. As can be seen, the
negative perturbations (Low, green triangle) lead to negative
rating drifts and positive perturbations (High, red dot) lead to
positive drifts in user ratings, while the accurate recommendations
with no perturbation (Accurate, black square) lead to drifts around
zero. For each rating display, we performed pairwise t-tests to
compare user-submitted ratings after receiving high and low
artificial recommendations. The t-test results are presented in
Table 6.</p>
        <p>Mean Rating Drift (Bars are One Standard Error)
0.5
iftr
D
g
n
it
aR0.0
e
g
a
r
e
v
A
-0.5</p>
        <p>H A L
Binary</p>
        <p>H A L H A L H A L H A L H A L
Graphic-Precise Graphic-Vague Numeric-Precise Numeric-Vague Star-Number
H A L
Star-Only
Fig 4. Mean and standard deviation of user rating drift after
receiving high perturbed (High: red dot), low perturbed
(Low: green triangle), and non-perturbed recommendations
(Accurate: black square).</p>
        <p>All mean rating drift comparisons between High and Low
perturbed conditions are significant for all rating display options
(one-tailed p-value &lt; 0.001 for all High vs. Low tests), showing a
clear and positive anchoring bias of system recommendations on
consumers’ rating drift. Such anchoring biases exist in both High
and Low perturbed conditions for the majority of the rating
display options. The results clearly demonstrate that the
anchoring effect of perturbed recommendations still exist in all
rating display options investigated in our experiment. Hence,
similar to the artificial groups, we found that none of the seven
rating display options could completely remove the anchoring
biases generated by perturbed real recommendations.</p>
        <p>We next performed regression analysis to compare the size of
anchoring bias across different rating display options, while
controlling for participant-level factors. In our regression
analysis, we created a panel from the data as each participant was
exposed to both high and low perturbed recommendations in a
random fashion. The standard OLS model using robust standard
errors, clustered by participant, and participant-level controls
represents our model for the analysis.</p>
        <p>RatingDriftij = b0 + b1(Groupi) + b2(Highij) + b3(Groupi× Highij) +
b4(PredictedRatingij) + b5(Controls) + ui + εij</p>
        <p>In the above regression model, RatingDriftij is the difference
between submitted rating and predicted rating for participant i on
joke j, Groupi is the rating display option shown to participant i,
Highij indicates whether the recommendation for participant i on
joke j was perturbed upward or downward. Controls is the same
vector of joke and consumer-related variables that was used in the
previous regression analysis for artificial recommendations.</p>
        <p>The regression model used ordinary least squares (OLS)
estimation and a random effect to control for participant-level
heterogeneity. The Numeric-Precise rating display condition was
again chosen to be the baseline rating representation to compare
with the other six options. Table 7 summarizes the regression
analysis of perturbed recommendations.</p>
        <p>Consistent with what we found in the artificial conditions,
interaction terms between anchoring and some non-numeric
displays including Binary and Graphic-Vague were significantly
negative. Thus, when recommendations were displayed in Binary
and Graphic-Vague formats, they generated much smaller rating
drifts from consumer’s actual preference, when compared to the
baseline Numeric-Precise display.</p>
        <p>Similar to Table 5, we also performed a 2×2 analysis on the
two main dimensions: representation (numeric vs. graphic) and
vagueness (precise vs. vague) of the displayed recommendations.
Our results in Table 8 confirm that presenting recommendations
in numeric format can lead to much larger ratings shifts in
consumer’s preference ratings than presenting the same
recommendations in graphical format. The vagueness of
recommendation value, however, does not have significant
influence on size of anchoring bias.</p>
        <p>Overall, we observed that the real recommendations presented
graphically can significantly lead to lower anchoring biases than
real recommendations displayed in numeric forms (either as a
precise number or as a numeric range). In addition, displaying
real recommendations in binary format leads to much lower
anchoring biases compared to recommendations in numeric forms
(both numeric-precise and numeric-vague). Further, displaying
real recommendations as a vague numeric range could not
significantly reduce anchoring biases when compared to the
benchmark approach of showing a precise value.</p>
      </sec>
      <sec id="sec-10-3">
        <title>4.2.3 Discussion</title>
        <p>Using several regression analyses and controlling for various
participant-level factors, we found that none of the seven rating
display options completely removed the anchoring biases
generated by recommendations. However, we observed that some
rating representations were more advantageous than others. For
example, we find that graphical recommendations can lead to
significantly lower anchoring biases than equivalent numeric
forms (either as a precise number or a numeric range). In
addition, displaying recommendations in binary format leads to
lower anchoring biases compared to recommendations in numeric
forms.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>5. CONCLUSIONS</title>
      <p>This paper focuses on the problem of “de-biasing” users’
submitted preference ratings and proposes two possible
approaches to remove anchoring biases from self-reported ratings.</p>
      <p>The first proposed approach uses post-hoc adjustment rules to
systematically sanitize user-submitted ratings that are known to be
biased. We ran experiments under a variety of settings and
explored both global adjustment rules and user-specific
adjustment rules. Our investigation explicitly demonstrates the
advantage of unbiased ratings over biased ratings on
recommender systems’ predictive performance. We also
empirically show that post-hoc de-biasing of consumer preference
ratings is a difficult task. Removing biases from submitted ratings
using a global rule or user-specific rule is problematic, most likely
due to the fact that the anchoring effects can manifest themselves
very differently for different users and items. This further
emphasizes the need to investigate more sophisticated post-hoc
de-biasing techniques and, even more importantly, the need to
proactively prevent anchoring biases in recommender systems
during rating collection.</p>
      <p>Therefore, the second proposed approach is a
user-interfacebased solution that tries to minimize anchoring biases at rating
collection time. We provide several ideas for recommender
systems interface design and demonstrate that using alternative
representations can reduce the anchoring biases in consumer
preference ratings. Using a laboratory experiment, we were not
able to completely avoid anchoring biases with any of the variety
of carefully designed user interfaces tested. However, we
demonstrate that some interfaces are more advantageous for
minimizing anchoring biases. For example, using graphic, binary,
and star-only rating displays can help reduce anchoring biases
when compared to using the popular numerical forms.</p>
      <p>In future research, another possible de-biasing approach might
be through consumer education, i.e., to make consumers more
cognizant of the potential decision-making biases introduced
through online recommendations. This constitutes an interesting
direction for future explorations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Adomavicius</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bockstedt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curley</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and Zhang,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>"Do Recommender Systems Manipulate Consumer Preferences? A Study of Anchoring Effects,"</article-title>
          <source>Information Systems Research</source>
          ,
          <volume>24</volume>
          (
          <issue>4</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Adomavicius</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bockstedt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curley</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and Zhang,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>"Effects of Online Recommendations on Consumers' Willingness to Pay,"</article-title>
          <source>Conference on Information Systems and Technology</source>
          . Phoenix, AZ.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Koren</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>"Improved NeighborhoodBased Collaborative Filtering,"</article-title>
          <source>KDDCup'07</source>
          , San Jose, CA, USA,
          <fpage>7</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Bennet</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lanning</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>"The Netflix Prize,"</article-title>
          <source>KDD Cup and Workshop</source>
          , www.netflixprize.com.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Cosley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lam</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Albert</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konstan</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Riedl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>"Is Seeing Believing? How Recommender Interfaces Affect Users' Opinions," CHI 2003 Conference</article-title>
          , Fort Lauderdale FL.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Fagerlin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Zikmund-Fisher,
          <string-name>
            <given-names>B.J.</given-names>
            ,
            <surname>Ubel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.A.</given-names>
            ,
            <surname>Jankovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Derry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.A.</given-names>
            , and
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.M.</surname>
          </string-name>
          <year>2007</year>
          .
          <article-title>"Measuring Numeracy without a Math Test: Development of the Subjective Numeracy Scale,"</article-title>
          <source>Medical Decision Making</source>
          ,
          <volume>27</volume>
          ,
          <fpage>672</fpage>
          -
          <lpage>680</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Konstan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maltz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herlocker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gordon</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Riedl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>"Grouplens: Applying Collaborative Filtering to Usenet News,"</article-title>
          <source>Comm. the ACM</source>
          ,
          <volume>40</volume>
          ,
          <fpage>77</fpage>
          -
          <lpage>87</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Koren</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Volinsky</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>"Matrix Factorization Techniques for Recommender Systems,"</article-title>
          <source>IEEE CS</source>
          ,
          <volume>42</volume>
          ,
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Lemire</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>"Scale and Translation Invariant Collaborative Filtering Systems,"</article-title>
          <source>Information Retrieval</source>
          ,
          <volume>8</volume>
          (
          <issue>1</issue>
          ),
          <fpage>129</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Sarwar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karypis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konstan</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Riedl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>"Item-Based Collaborative Filtering Recommendation Algorithms,"</article-title>
          <source>Int'l WWW Conference</source>
          , Hong Kong,
          <fpage>285</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Sarwar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karypis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konstan</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Riedl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>"Item-Based Collaborative Filtering Recommendation Algorithms," the 10th</article-title>
          <source>International WWW Conference, Hong Kong</source>
          ,
          <fpage>285</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>