Iterative Smoothing Technique for Improving Stability
                    of Recommender Systems
                    Gediminas Adomavicius                                                                 Jingjing Zhang
                        University of Minnesota                                                           Indiana University
                          gedas@umn.edu                                                                jjzhang@indiana.edu

ABSTRACT                                                                            made by the system.
We focus on the measure of recommendation stability, which                          According to the definition, stability is the consistent agreement
reflects the consistency of recommender system predictions.                         of predictions made on the same items by the same algorithm,
Stability is a desired property of recommendation algorithms and                    when any new incoming ratings are in complete agreement to
has important implications on users' trust and acceptance of                        system’s prior estimations [2]. As has been discussed in prior
recommendations. Prior research has reported that some popular                      work, stability is an important and desired property of
recommendation algorithms can suffer from a high degree of                          recommender systems, and has a number of potential implications
instability. In this study we propose a scalable, general-purpose                   related to users’ trust and acceptance of such systems [2].
iterative smoothing approach that can be used in conjunction with                   While providing stable and consistent recommendations is
different traditional recommendation algorithms to improve their
                                                                                    important in many contexts, prior research has demonstrated that
stability. Our experimental results on real-world rating data                       some popular collaborative filtering recommendation algorithms
demonstrate that the proposed approach can achieve substantially
                                                                                    can suffer from high degree of instability [2]. This is particularly
higher stability as compared to the original recommendation                         true for the widely used item- and user-based nearest-neighbor
algorithms. Importantly, the proposed approach not only does not
                                                                                    collaborative filtering approaches. It has also been shown that
sacrifice the predictive accuracy in order to improve                               stability does not necessarily correlate with predictive accuracy
recommendation stability, but is actually able to provide
                                                                                    [2], i.e., different recommendation algorithms can exhibit
additional accuracy improvements at the same time.
                                                                                    different levels of stability, even though they may have similar
1. INTRODUCTION                                                                     prediction accuracy. Thus, maximizing accuracy may not
                                                                                    necessarily help to improve stability, and vice versa. For instance,
Recommender systems represent technologies that assist users in
                                                                                    a simple heuristic that predicts any unknown user rating as an
finding a set of interesting or relevant items [1]. In order to
                                                                                    average of all known ratings of that user is perfectly stable [2];
provide good recommendations, recommender systems employ
                                                                                    however, in most real-world settings this heuristic is outperformed
users’ feedback on consumed items. This input can include
                                                                                    by more sophisticated recommendation algorithms in terms of
explicitly provided feedback in the form of ratings or tags, as well
                                                                                    predictive accuracy. Therefore, the main objective of this study is
as feedback that can be implicitly inferred by monitoring users’
                                                                                    to develop an approach that can improve stability of
behavior such as browsing, linking, or buying patterns. The most
                                                                                    recommendation algorithms without sacrificing their accuracy.
common approach to modeling users’ preferences for items is via
numeric ratings. The recommendation algorithm then analyzes                         In this paper, we propose a general iterative smoothing approach
patterns of users’ past ratings and predicts users’ preference                      to improve stability of any given recommendation technique. The
ratings for new, not yet consumed items. Once ratings for the new                   approach serves as a meta-algorithm, i.e., it can be used in
items are estimated, the item(s) with the highest estimated                         conjunction with any traditional recommendation technique.
rating(s) can be recommended to the user.                                           Accordingly, the paper evaluates the performance of the proposed
                                                                                    approach in conjunction with a number of popular and widely-
In the recommender systems literature, evaluating performance of
                                                                                    used recommendation algorithms in terms of their stability as well
recommendation algorithms has always been a key issue, and
                                                                                    as accuracy on several real-world movie rating datasets. The
recommendation accuracy has been the major focus in developing
                                                                                    results show that this meta-algorithmic approach provides
evaluation metrics [11,23]. As a result, much of the research in
                                                                                    substantial improvements in recommendation stability as
the recommender systems area has focused on proposing new
                                                                                    compared to the original recommendation algorithms, while
techniques to enhance the accuracy of recommendation
                                                                                    providing some additional accuracy benefits as well.
algorithms in predicting what users will like, as exemplified by
the recent $1M Netflix prize competition. Prediction accuracy                       2. RELATED WORK
metrics typically compare the rating values estimated by a                          Based on how unknown ratings are predicted, recommendation
recommendation algorithm against the actual rating values and                       techniques can be classified into three general categories: content-
reflect the closeness of the system’s predictions to users’ true                    based, collaborative filtering, and hybrid [1,3]. Among different
ratings. In addition to recommendation accuracy, researchers                        recommendation approaches, collaborative filtering techniques
have proposed a number of alternative types of measures,                            have been most widely used, largely because they are domain
including recommendation coverage, diversity, novelty,                              independent, require minimal, if any, information about user and
serendipity, and several others, to evaluate the performance of                     item features, yet can still achieve accurate predictions [13,19].
recommender systems [11,23]. Of special interest to us is the
recently introduced measure of recommendation stability [2],                        In a typical setting of collaborative filtering recommender
which reflects the level of consistency among the predictions                       systems, users’ preferences for items are modeled via numeric
                                                                                    ratings. Thus, the recommendation problem is reduced to the
                                                                                    problem of estimating ratings for the items that have not been
 Copyright is held by the author/owner(s). Workshop on Recommendation Utility       seen by a user, and this estimation is usually based on the other
 Evaluation: Beyond RMSE (RUE 2012), held in conjunction with ACM RecSys            available ratings given by this and/or other users. More formally,
 2012. September 9, 2012, Dublin, Ireland.                                          given a set of users U and a set of items I, the entire user-item


                                                                                3
sspace is denoted as S = U × I. Let
                                 L Rui representt rating that userr u
ggave to item i, where
                 w      Rui is typ
                                 pically known only
                                                 o      for a limiteed
ssubset of all po
                ossible (u,i) pairrs. Let D be thet set of know    wn
ratings, and S\DD be the set of unknown
                                 u         ratings. Therefore, th  he
recommendation  n task is to estim
                                 mate unknown Rui values for all   a
(u,i)S\D pairs, given
                 g     U, I, and known
                                 k      R(u,i) vallues for (u,i)  D.
                                                                   D
AAs mentioned eaarlier, predictive accuracy has been a major focus
inn recommender systems literatu  ure. One of the most widely useed
                                                                                Figuree 1. Illustration of stability com
                                                                                                                       mputation, adapteed from [2].
ppredictive accurracy metrics forr recommenderr systems is ro    oot
mmean squared errror (RMSE), which
                                 w      we will usse in this paper to           Providiing consistent ppredictions has important impllications on
report the recom
               mmendation accu    uracy results. However, there iss a           users’ ttrust and acceptaance of recommmendations. In thhe consumer
ggrowing understaanding that good d recommendatio  on accuracy alonne           psychoology literature, iit has been show
                                                                                                                        wn that online addvice-giving
mmay not give useers a satisfying experience
                                  e           using the recommend der           agents with greater fluuctuations in paast opinions aree considered
ssystems, and thaat some other (co omplementary) measures are also              less innformative thann those with a more uniform      m pattern of
immportant in evaluating the effecctiveness of the system [23]. Our
                                                                 O              opinionns [9], and advicce inconsistent with past recom mmendations
ffocus in this paaper is one of these importan    nt complementaary             is connsidered less helpful than consistent advvice [7,24].
mmeasures, recom
               mmendation stabiility [2].                                       Inconsiistent recommenndations may be discredited by uusers and, as
                                                                                a resullt, the decreasee in users’ trust will further rreduce their
RRecommendation     n stability meaasures the inheerent consistenccy
                                                                                percepttions of the recoommender system m’s competencee [12,17]. In
aamong the differrent predictions made
                                    m      by the sysstem. Consider ana
                                                                                contextts where conssistency is vittal, the instabbility of a
eexample where the  t system mak  kes predictions forf two movies, i1
                                                                                recomm mender system is likely to have a negative impaact on users’
aand i2, for user u. Let’s deno    ote the two rating predictions as
                                                                                acceptaance and, thus, hharm the successs of the system.
RR*(u,i1) and R*(u,i2). Let’s asssume that pred      diction R*(u,i1) is
pprecisely accuratte and, after userr u consumes item i1, it gets addeed        Prior reesearch has provvided a compreheensive investigaation into the
too the system as part of the know  wn ratings, i.e., R(u,i1) = R*(u,i1).       stabilityy of popular re commendation algorithms [2] aand showed
TThe recommendaation algorithm re-computes alll other predictions               that soome widely usedd algorithms caan be highly unnstable. For
inn light of the neew data. Would this change the value of the oth   her        examplle, it has be en shown thaat RMSS for user-based
pprediction, R*(u,i2), and to whaat extent, even though
                                                      t       the new
                                                                    wly         collaboorative filtering approach can bbe as high as 00.47 on the
inncoming rating data was exactly                    b the system? In
                                     y as predicted by                          Netflixx movie ratingg data, meaninng that on aveerage every
oother words, two predictions R*(u,i
                                  R       1) and R*( (u,i2) of the sam
                                                                     me         predicteed rating will shhift by 0.47 starrs (i.e., by aboutt a half-star)
recommender sy    ystem can be vieewed as “inconssistent” with eacch            after aadding to the ttraining data soome new ratinngs that are
oother, if adding one of them to the training daata for the systeem             identicaal to the systemm’s current preddictions [2]. Thhis is a very
cchanges the otheer prediction. AsA discussed in [2], the degree of             significcant shift in preddiction, considerring the length oof the rating
thhe change in predictions reeflects the (in        n)stability of th he        scale fo
                                                                                       for the dataset iss only 4 stars (i..e., ratings go frrom 1 to 5).
recommendation    n algorithm.                                                  Thus, ttechniques for sstability improvvement are needded. In this
                                                                                paper, we propose a ggeneral-purpose meta-algorithm       mic approach
SSpecifically, the stability of a reecommender sysstem is defined as
                                                                                that ccan be used to improve stability of traditional
thhe extent to wh hich the system’’s predictions fo or the same itemms
                                                                                recomm mendation algoriithms.
sstay the same/siimilar (i.e., are stable), when any and all neew
inncoming rating  gs submitted to   o the system are in compleete               3. IT
                                                                                    TERATIVE  E SMOOTH
                                                                                                     HING APPR
                                                                                                             ROACH
aagreement with system’s
                    s        prior predictions.
                                   p             Bassed on this idea,, a
twwo-phase appro   oach (illustratedd in Figure 1) for computin      ng
                                                                                3.1 G
                                                                                    General Ideea
sstability of a reccommendation algorithm
                                    a         has been introduced in            High innstability resultss from predictionns that are inconnsistent with
pprior literature [22]. In Phase 1,, given a set off known ratings, a           each otther. We propoose an iterative ssmoothing approoach, which
ppredictive modell is built, and prredictions for alll unknown ratings          involvees multiple iteerations for reepeatedly and collectively
aare made. Then, a random subseet of system’s prredictions is addeed            adjustinng the rating prredictions of a recommendatioon algorithm
too the original seet of known ratinngs as hypothetiical new incomin ng         based oon its other preedictions and, tthus, is explicitly aimed at
ratings. In Phasse 2, based on expanded
                                    e         traininng data, a seconnd         improvving consistencyy of predicted ratings. The kkey idea of
ppredictive modeel is built using the same recommendatio             on         iterativve smoothing is tthat the rating ppredictions compputed during
teechnique, and predictions on remaining unkn        nown ratings are
                                                                    a           currentt iteration will bbe fed back intto the data to ppredict other
mmade. Stability is then measured by comparin      ng the two sets of           ratings in subsequent itterations.
ppredictions to coompute their root mean squared difference, whicch             Figure 2 provides aan overview oof the proposeed iterative
iss called root meean squared shif ift (RMSS), and is computed in  na           smoothhing algorithm, aand Figure 3 gives a high-levell illustration
ssimilar fashion as RMSE:                                                       of the overall process.. Given rating space S and traaining set D
                                                                                where rratings are know wn, predictions oon unknown ratiings S\D are
    RMSS                              ,          ,        |   ∩    |            first esttimated using soome standard recommendation aalgorithm T.
                      , ∈   ∩                                                   These ppredictions are denoted as P0. The procedure of iterative
                                                                                smoothhing then starts to iteratively aadjust estimations for each
wwhere P1 and P2 are the prediictions made in     n Phase 1 and 2,              rating iin S\D based onn all other ratinggs in the rating sspace S (i.e.,
respectively, i.e.,, RMSS captures the shift in preedictions made by
                                                                   b            both knnown as well ass predicted) in order to proactively improve
thhe same recomm mendation algorrithm with new ratings
                                                   r       that are in          consisteency between diifferent predictedd ratings.
ccomplete agreemment with algorith
                                 hm’s own prior estimations.
                                                   e
                                                                                More sspecifically, duriing the k-th iteraation (for any k = 1, 2, …),
                                                                                for eacch unknown userr-item pair (u,i))S\D, a model fk,u,i is built
                                                                                based oon training dataaset Dk,u,i. Thiis dataset is connstructed to
                                                                                containn all known ratinngs combined w  with all predictioons on user-


                                                                            4
ittem pairs in S\DD computed in th    he previous (k-1) iteration, exceept
ffor the predictionn on (u,i) as indicated in Figure 2.
                                                      2 As the result of
thhe k-th iteration
                  n, each model fk,u,i
                                     k   produces a new
                                                     n     prediction for
                                                                       f
(u,i) that is stored
                   d as Pk(u,i), i.e.:
                                 , ,              ,    ∈
                 ,
                           , ,    , ,              ,   ∈ \
HHere R(u,i) repreesents the know  wn rating that usser u gave item i,
aand fk,u,i is the prediction
                   p          modeel built based onn Dk,u,i. Figure 4
illlustrates the process
                   p         within each iteration of the iterativ  ve
ssmoothing appro   oach. The set ofo predictions maade in the curreent
itteration is com mpared with preedictions made in the previou      us
itteration to comp pute the deviation between the tw
                                                   wo sets (measureed
inn root mean squ uared difference). The iterative smoothing
                                                   s          proceess
sstops either afterr a fixed, pre-deetermined number of iterations or
                                                                                  Figurre 3. Illustration of the general itterative smoothinng process.
wwhen predictionss on unknown raatings do not chaange.
BBy definition, in each iteration, a separate modell is built for eveery
uuser-item pair wiith unknown ratiing. Thus, in tottal, |S\D|K modeels
nneed to be constrructed over the course of K iterrations, each usin  ng
|SS|–1 ratings as a training datasset. If t(x) is th
                                                   he time needed to
bbuild a predicttive model on data sample of size x usin              ng
recommendation   n algorithm T, then
                                   t    the time complexity
                                                   c             of thhe
itterative smoothiing approach is O(|S\D|Kt(|S|)).
                                   O               .
 Iterative Smoothing Algorithm        m:
 Inputs: known     n ratings data D, # of iterations K,  K algorithm T
 Process:
 1. Build modell f0 on known ratings D using             g some standard
                                                                       d
 recommendation    n algorithm T, i.e., f0T(D)
 2. Apply modeel f0 to compu          ute predictions P0 for unknown   n
 ratings S\D, i.e.,, P0(u,i) = f0(u,i)) for (u,i)S\D                                     Figure 4. Illusstration of one sm
                                                                                                                          moothing iteratiion.
 3. For each iteraation k  {1, …, K}
     For each unk  known rating paair (u, i)S\D                                 3.2 S
                                                                                     Scalable Iterative Smooothing
      a. Construcct dataset Dk,u,i by b including all known ratings D            We prropose a variatiion of iterativee smoothing appproach that
         and all predicted
                 p             ratings Pk-1 from the prrevious iteration
                                                                       n,        substanntially simplifiees the original approach and ssignificantly
         except fo
                 or rating Pk-1(u,i)), i.e.,                                     reducess its computationnal requirementss.
                      Dk,u,i = D ∪ Pk-1\ {Pk-1(u,i)}                             Figure 5 provides an ooverview of the proposed scalaable iterative
      b. Build mo odel fk,u,i on dataaset Dk,u,i using T,
                                                        T i.e.,                  smoothhing algorithm..       Similarly to the original iterative
                              fk,u,i  T (Dk,u,i)                                smoothhing approach, ththe proposed scaalable version allso involves
      c. Make preediction on (u, i)) and store in P k, i.e.,                     multiplle iterations for repeatedly adjuusting estimatioons for each
                            Pk(u, i) = fk,u,i(u, i)                              unknowwn rating. How   wever, in each itteration, instead of building
 4. Output predicctions made in th     he final iteration
                                                        n PK                     one moodel for each unkknown rating, onnly one single m model is built
 Output: PK                                                                      upon alll known ratingss and predicted ratings in previouus iteration.
              Figure 2. Iterative smoothing
                                       s            appro
                                                        oach.
                                                                                 Specifiically, during k-tth iteration (for aany k = 1, 2, …)), model fk is
Inn other words, thhe complexity ofo iterative smoo othing algorithm is          built bbased on traininng dataset Dk w      which contains all known
pproportional to the
                   t size of unkn   nown ratings in the rating spacce.           ratings D combined wiith all predictionns Pk-1 on user-item pairs in
TThis computatio  onal requiremen  nt is likely to be prohibitiveely             S\D coomputed in thee previous iteraation. Pk(u,i) denotes the
eexpensive for many
                 m       real-world scenarios, i.e., anytime the useer-          predicteed rating for ((u,i) in the k iteration by thee collective
ittem rating spacee is large and raating data is sparrse. For examplle,          inferennce approach, deffined as follows:
thhe Movielens 100K dataset is a publicly available movie ratin       ng
                                                                                                                  , ,                 ,   ∈
ddataset [10], wh hich is considerred to be relativ  vely small. Th  his                            ,
ddataset contains 100,000 movie ratings from 943 users on 168         82                                         , ,              ,       ∈ \
mmovies. Thus, th  he total number of possible ratin ngs is about 1.6M           Here RR(u,i) represents the known ratiing that user u ggave item i.
(i.e., 1682 x 94  43), of which 1.5M
                                   1      is unknow wn. If we app    ply         As the result of the k-thth iteration, for eeach (u,i)S\D, the model fk
itterative smoothiing algorithm on  n the Movielenss 100K dataset, in            producees a new preddiction for (u,i)) that is storedd as Pk(u,i).
tootal 1.5 million n predictive mo odels (each built on 1.6 millio    on         Similarrly to the original approach, new    w predictions arre compared
ratings) would need to be co       onstructed withiin each iteratio  on.         with prredictions made in the previous iteration and thhe procedure
TTherefore, applying the full iteraative smoothing approach may not  n           ends eiither after a fixeed number of iteerations or whenn predictions
bbe feasible even for datasets thaat are not very large.
                                                     l       In order to         on unknnown ratings converge (i.e., do nnot change).
oovercome this co  omplexity issue, in the next secttion we proposee a
                                                                                 The keey difference bbetween this sim     mplified variatiion and the
vvariation of th  he iterative sm  moothing algoriithm that offeers
                                                                                 originaal iterative smoothing algorithm is the number oof predictive
ssignificant scalabbility improvem ments while retaining most of th   he
                                                                                                                         The original algorithm builds
                                                                                 modelss built within eacch iteration k. T
sstability benefits (as will be show
                                   wn in the Resultss section).
                                                                                 a separrate model for evvery unknown rrating in the ratinng space, in
                                                                                 order too properly adjusst the predicted rrating Pk(u,i) usiing all other


                                                                             5
ratings from previous iteration, i.e., all ratings from D as well all         was used for building rating prediction models, while validation
ratings Pk-1(u',i') where (u',i')  (u,i).                                    set DV was reserved exclusively for evaluating the predictive
                                                                              accuracy of the final predictions. Similarly, a randomly chosen
 Scalable Iterative Smoothing Algorithm:                                      half of the unknown rating space ET was dedicated for the stability
 Inputs: known ratings data D, # of iterations K, algorithm T                 evaluation of rating prediction models during the training phase,
 Process:                                                                     and the other half of the unknown rating space EV was reserved
 1. Build model f0 on known ratings D using some standard                     exclusively for proper evaluation of the stability of the final
 recommendation algorithm T, i.e., f0T(D)                                    predictions. Here, S\D = ET  EV and ET  EV = .
 2. Apply model f0 to compute predictions P0 for unknown
 ratings S\D, i.e., P0(u,i) = f0(u,i) for (u,i)S\D                            Step 0: Create training and test datasets, i.e., DT, DV, ET, EV.
 3. For each iteration k  {1, …, K}                                           Step 1: Find the best model parameters:
    a. Construct dataset Dk by including all known ratings D                       i. Use a portion (e.g., 75%) of training set DT to build rating
       and all predicted ratings Pk-1 from the previous iteration,                    prediction models using iterative smoothing.
       i.e.,                                                                      ii. Compute model accuracy on other portion (25%) of DT.
                             Dk = D ∪ Pk-1                                       iii. Compute model stability on predictions made on ET.
    b. Build model fk on dataset Dk using T, i.e.,                               iv. Repeat steps i-iii with various parameter settings for
                               fk  T(Dk)                                             iterative smoothing (i.e., number of iterations).
    c. For each unknown rating pair (u, i)S\D , make                             v. Find the best parameters for iterative smoothing, i.e., the
       prediction on (u, i) and store in Pk, i.e.,                                    specifications that result in best stability and accuracy.
                            Pk(u, i) = fk(u, i)                                Step 2: Evaluate accuracy and stability of the final predictions
 4. Output predictions made in the final iteration PK                              i. Use the best parameter settings for iterative smoothing.
 Output: PK                                                                       ii. Train the system on the entire training set DT using
  Figure 5. Overview of scalable iterative smoothing approach.                        iterative smoothing.
                                                                                 iii. Evaluate predictive accuracy on the reserved validation
In contrast, the simplified algorithm builds only one predictive                      rating set DV.
model in each iteration, based on the entire rating matrix. In other             iv. Evaluate recommendation stability on the reserved
words, predicted rating Pk(u,i) is adjusted using all ratings from                    unknown rating space EV.
previous iteration, i.e., all ratings from D as well as from Pk-1,             Step 3: Report parameter setting(s) as well as the accuracy and
including Pk-1(u,i). Thus, in the simplified algorithm, for any                stability of the final predictions.
given rating prediction P1(u,i) in the first iteration, the predictive                       Figure 6. Overall experimental process.
model is built on a rating data (i.e., D1) that only differs from the
rating data used in the original algorithm by one additional rating           Additionally, the process of iterative smoothing involves multiple
(i.e., D1\{P0(u,i)}). Because the influence of one additional rating          iterations to adjust the predictions of unknown ratings. One of the
is often subtle, especially when entire rating space is large (i.e., in       goals in this study is to find whether predictions converge during
settings with large numbers of users and items), the single overall           the process of iterative smoothing and, if so, when. In addition,
model build in the simplified algorithm should produce outcomes               there is possibility that iterative smoothing models can “over-
similar to the ones produced by individual models built in the                adjust” rating predictions after a number of iterations, in their
original algorithm, especially in the first iteration. While the              attempt to maximize the performance on training data. Over-
difference between the original and simplified versions of the                fitting is a well-known phenomenon which occurs when a
iterative smoothing may slowly increase as the number of                      predictive model is fine-tuned to fit the training data (including
iterations grows, the simplified approach still provides significant          the random errors, outliers, and noise in the data) too well, which
performance improvements (both in stability and accuracy), as                 typically leads to diminished predictive performance on test data.
demonstrated by the experimental results later in the paper.                  Therefore, for a given algorithm, it is necessary to find the best
Moreover, the runtime complexity of the simplified algorithm is               number of iterations to use on a given dataset in the final iterative
much lower, making it much more practical from the scalability                smoothing procedure. In order to find the optimal parameter
perspective. In particular, as only one overall model is built on all         settings, we further used the standard train-test data splitting
available ratings (i.e., dataset of size |S|) within each iteration, in       approach internally within the training data to identify the best
total K models are constructed over the course of K iterations.               parameter values for the proposed approach, i.e., that result in best
Thus, the time complexity of the simplified variation is                      accuracy and stability. The overall experiment process is
O(Kt(|S|)). Comparing this to the complexity of the original                 summarized in Figure 6.
algorithm, O(|S\D|Kt(|S|)), the scalable heuristic offers huge              4.2 Recommendation Algorithms
computational improvements (i.e., by roughly |S\D| times). In this            In our experiments, we test the proposed approach in conjunction
paper, we use scalable iterative smoothing in our experiments.                with four popular recommendation algorithms: the simple
4. EXPERIMENTAL RESULTS                                                       baseline method, classic user- and item-based variations of
                                                                              neighborhood-based CF approaches, and the matrix factorization
4.1 Overall Process                                                           technique. A brief overview of each technique is provided below.
Our experiments test the stability improvements achieved by the
proposed meta-algorithmic approach in conjunction with several                Baseline. In real-world settings, some users may systematically
popular collaborative filtering techniques.                                   tend to give higher ratings than others, and some universally liked
                                                                              items might receive higher ratings than others.             Without
The experiments follow the two-phase stability computation from               normalization, such user and item effects could bias system’s
prior literature [2], discussed in Section 2. We used the standard            predictions. Hence, recommender systems often involve a pre-
train-test data splitting approach and divided known ratings data             processing step to remove these “global effects”. One common
D into two sets: training data DT (80%) and validation data DV                practice is to estimate and remove three effects: the overall mean,
(20%), where D = DT  DV and DT  DV = . Training set DT                     the main effect of an item, and the main effect of a user [4]. Such


                                                                          6
“global effects” can serve as a baseline estimate for unknown               approximation error. After the two sub-matrices are learned using
rating of corresponding user and item, i.e.,                                known ratings, each unknown rating is estimated as a dot-product
                         bui = µ + bu + bi ,                                of the corresponding user- and item-factors vectors. Many
                                                                            variations of matrix factorization techniques have been developed
where µ is the overall average rating, bu is the average observed           during the recent Netflix Prize competition (e.g., [14,15,18,21]).
deviation from µ on ratings provided by user u, and bi is the               Our experiments focus on the basic underlying version of the
average observed deviation from µ on ratings given to item i.               matrix factorization [8]; however, the proposed meta-algorithmic
Note that, in all of our experiments (i.e., with all other                  approach can be applied with any variation of this technique.
recommendation algorithms), the ratings data were normalized by
removing these global effects. Moreover, this estimate is often             4.3 Results: Comparing Iterative Smoothing
used as a baseline recommendation technique for comparison with             with Standard Recommendation Techniques
other recommendation algorithms, i.e., R*(u,i) = bui, and we                The objective of the experiment is to compare the performance of
investigate its performance in our experiments as well.                     the proposed iterative smoothing approach with standard single-
User-Based Collaborative Filtering (CF_User). The user-based                model recommendation techniques on several real world datasets.
nearest-neighbor collaborative filtering approach is a heuristic that       The first dataset we used is the Movielens 100K dataset [10],
makes predictions of unknown ratings for a user based on the                which contains 100,000 known ratings on 1682 movies from 943
ratings previously rated by this user’s “nearest neighbors”, i.e.,          users (6.3% data density). Our second dataset is a sample
other users who have similar rating patterns [6,20]. That is, the           extracted from the Movielens 1M dataset. The original Movielens
value of the unknown rating for user u and item i is usually                1M dataset consists of 1,000,000 ratings for 6040 movies by 3952
computed as an aggregate of the neighbors’ ratings for the same             users (4.2% data density) [10]. From this dataset we extracted a
item i. The most common aggregation approach is the weighted                random sample of 3000 users and 3000 movies. Resulted dataset
sum of the neighbors’ ratings, where the similarity of two users is         contains 400,627 known ratings (i.e., 4.45% data density). Our
used as a weight. I.e., the more similar user u' and target user u          third dataset is sampled from the Netflix 100M dataset used in the
are, the more weight will be carried by the rating provided by user         recent Netflix Prize competition [5]. Similarly to the second
u' on item i in the weighted sum when computing the prediction.             dataset, we sub-sampled 3000 random users and 3000 random
Predicted rating for user u on item i is computed as:                       movies from the original data file. The result data sample consists
         ∗
                          ∑ ∈     ,             ∗    ,                      of 105,256 known ratings (i.e., 1.17% data density). The three
             ,                                                              datasets used in our experiments come from different sources and
                                      ∑ ∈   ,    |       |
                                                                            have different data characteristics (i.e., size and sparsity). All
where N(u,i) is a set of “neighbors” with similar rating patterns to        movie ratings in the Movielens and Netflix datasets are integer
user u and that have provided ratings for item i, simuv is the              values between 1 and 5, where 1 represents the least liked movies,
similarity between users u and v, and bui is the baseline estimate          and 5 represents the most liked movies. The datasets used in this
for user u on item i. In our implementation, two users must have            experiment are summarized in Table 1.
rated at least 3 items in common to allow computation of
similarity between them. The similarity between two users is                             Table 1. Summary of Experimental Datasets.
calculated as Pearson correlation between rating vectors (based on           DataSet        Description               Users   Items     Density
the commonly rated items) of the two users. Prediction of each               Movielens
                                                                                              Movie ratings from       943     1682      6.30%
unknown rating is formulated by combining the preferences of 20                100K
                                                                                               Movielens movie
most similar users who have rated the same item.                             Movielens
                                                                                             recommender system.      3000     3000      4.45%
                                                                                1M
Item-Based Collaborative Filtering (CF_Item). The user-based
                                                                                                 Movie ratings
collaborative filtering technique also has an analogous item-based             Netflix       distributed by Netflix   3000     3000      1.17%
version, where the ratings of the nearest-neighbor items are used                                   company.
to predict unknown ratings for a given item. Several studies have
presented empirical evidence that item-based algorithms often               The procedure of this experiment followed the general
provide better predictive accuracy than user-based methods (e.g.,           experimental process described in Figure 6. We examined the
[22]). Thus, our experiments also test the standard item-based              prediction accuracy and stability of the final predictions on the
collaborative filtering in conjunction with the proposed approach.          reserved validation datasets (as described in Figure 6, Step 2).
Similarly to the settings employed in user-based CF, in our                 For each recommendation algorithm used in our study, we
experiments, two items are required to have been rated by 3                 compare two approaches: the standard (original) single-model
common users to allow similarity evaluation between them, and               approach and the scalable version of iterative smoothing
20 nearest-neighbor items are used to formulate a prediction.               approach. Accuracy and stability numbers (measured by RMSE
                                                                            and RMSS) of the two approaches on real-world movie rating
Matrix factorization (SVD). Matrix factorization technique is a
                                                                            datasets are provided in Table 2.
model-based (as opposed to heuristic-based) collaborative
filtering approach that characterizes items and users via a number          Experimental results are consistent across different datasets. On
of latent factors inferred from known ratings [8,15]. This                  all three datasets, our proposed meta-algorithmic iterative
technique models the U×I rating space as a product of two sub-              smoothing approach outperformed the original recommendation
matrices: user preference matrix (U×L) and item feature matrix              techniques in both stability and accuracy in the vast majority of
(L×I). Each user and item is described by a vector of L latent              cases. In particular, on average (i.e., across all datasets), iterative
variables. In our experiments L is set to be 20. The user vector            smoothing provided a dramatic 55% improvement over the
indicates the preference of the user for several latent features, and       original recommendation algorithms in stability (as measured by
the item vector represents an item’s importance weights for the             RMSS) for CF_User and CF_Item algorithms. Even for the fairly
same latent features. Singular value decomposition (SVD)                    stable SVD and baseline techniques, on average, iterative
techniques are used to decompose original rating matrix into the            smoothing was able to further improve RMSS by 14%. In terms
two sub-matrices in an optimal way that minimizes the resulting             of predictive accuracy, on average, iterative smoothing provided


                                                                        7
1.4% improvements in RMSE across different algorithms.                     REFERENCES
     Table 2. Iterative Smoothing vs. Standard Techniques.                 [1]  Adomavicius, G., and Tuzhilin, A. 2005. "Toward the Next
                                                                                Generation of Recommendation Systems: A Survey of the State-of-
                                             Accuracy     Stability             the-Art and Possible Extensions," IEEE Transactions on Knowledge
               Method       Approach
                                             (RMSE)       (RMSS)                and Data Engineering, 17, (6), 734-749.
                            Standard          0.9437       0.0866          [2] Adomavicius, G., and Zhang, J. 2010. "On the Stability of
               SVD
                            Smoothing         0.9378       0.0779               Recommendation Algorithms," ACM Conference on Recommender
                            Standard          0.9684       0.4023               systems, Barcelona, Spain: ACM New York, NY, USA 47-54.
               CF_User                                                     [3] Balabanovic, M., and Shoham, Y. 1997. "Fab: Content-Based,
  Movielens                 Smoothing         0.9586       0.2020
   100K                     Standard          0.9560       0.3234               Collaborative Recommendation," Comm. of ACM, 40, (3), 66-72.
               CF_Item                                                     [4] Bell, R.M., and Koren, Y. 2007. "Scalable Collaborative Filtering
                            Smoothing         0.9320       0.1347
                            Standard          0.9758       0.1148               with Jointly Derived Neighborhood Interpolation Weights " Seventh
               Baseline                                                         IEEE International Conference on Data Mining, Omaha, NE, USA.
                            Smoothing         0.9609       0.0569
                                                                           [5] Bennett, J., and Lanning, S. 2007. "The Netflix Prize," KDD-Cup
                            Standard          0.8846       0.0804
               SVD                                                              and Workshop, San Jose, CA.
                            Smoothing         0.8798       0.0719
                                                                           [6] Breese, J.S., Heckerman, D., and Kadie, C. 1998. "Empirical
                            Standard          0.9393       0.3294               Analysis of Predictive Algorithms for Collaborative Filtering," 14th
               CF_User
  Movielens                 Smoothing         0.9182       0.1145               Conf. on Uncertainty in Artificial Intelligence, Madison, WI.
    1M                      Standard          0.9135       0.2755          [7] D’astous, A., and Touil, N. 1999. "Consumer Evaluations of Movies
               CF_Item
                            Smoothing         0.8929       0.1089               on the Basis of Critics' Judgments," Psychology & Marketing, 16,
                            Standard          0.9425       0.0910               677–694.
               Baseline
                            Smoothing         0.9346       0.0923          [8] Funk, S. 2006. "Netflix Update: Try This at Home."
                            Standard          0.9372       0.1208          [9] Gershoff, A., Mukherjee, A., and Mukhopadhyay, A. 2003.
               SVD                                                              "Consumer Acceptance of Online Agent Advice: Extremity and
                            Smoothing         0.9363       0.1137
                            Standard          0.9608       0.4610               Positivity Effects," J. of Consumer Psychology, 13, (1&2), 161-170.
               CF_User                                                     [10] Grouplens 2011. "Movielens Data Sets."
                            Smoothing         0.9407       0.2462
   Netflix                                                                 [11] Herlocker, J.L., Konstan, J.A., Terveen, K., and Riedl, J.T. 2004.
                            Standard          0.9579       0.4394
               CF_Item                                                          "Evaluating Collaborative Filtering Recommender Systems," ACM
                            Smoothing         0.9381       0.2291
                            Standard          0.9622       0.1465               Transactions on Information Systems, 22, (1), Jan, 5-53.
               Baseline                                                    [12] Komiak, S., and Benbasat, I. 2006. "The Effects of Personalization
                            Smoothing         0.9556       0.1351
                                                                                and Familiarity on Trust and Adoption of Recommendation Agents,"
                                                                                MIS Quarterly, 30, (4), 941-960.
5. CONCLUSIONS AND FUTURE WORK                                             [13] Koren, Y. 2009. "The Bellkor Solution to the Netflix Grand Prize."
This paper introduces a general-purpose, practical meta-                        from http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf.
algorithmic approach – based on iterative smoothing – for                  [14] Koren, Y. 2008. "Factorization Meets the Neighborhood: A
improving stability of a variety of traditional recommendation                  Multifaceted Collaborative Filtering Model," ACM Intl. Conf. on
algorithms. The iterative smoothing approach uses multiple                      Knowledge Discovery and Data Mining (KDD'08), 426-434.
iterations to repeatedly and explicitly adjust predictions of a            [15] Koren, Y., Bell, R., and Volinsky, C. 2009. "Matrix Factorization
                                                                                Techniques for Recommender Systems," IEEE Computer, 42, 30-37.
recommendation algorithm based on its other predictions in order
                                                                           [16] Macskassy, S.A., and Provost, F. 2007. "Classification in Networked
to make them more consistent with each other. We examined the                   Data: A Toolkit and a Univariate Case Study," Journal of Machine
performance of iterative smoothing approach on several real                     Learning R, 8, 935-983.
world datasets. Our experiments show that the proposed approach            [17] O'Donovan, J., and Smyth, B. 2005. "Trust in Recommender
demonstrates effectiveness in their ability to improve stability for            Systems," 10th Intl. Conf. on Intelligent User Interfaces.
several widely used recommendation algorithms. Perhaps as                  [18] Paterek, A. 2007. "Improving Regularized Singular Value
importantly, the proposed approach not only does not sacrifice the              Decomposition for Collaborative Filtering," KDDCup, 39-42.
predictive accuracy to obtain these stability improvements, but            [19] Pilaszy, I., and Tikk, D. 2009. "Recommending New Movies: Even a
                                                                                Few Ratings Are More Valuable Than Metadata,"Third ACM Conf.
actually is able to provide some additional accuracy
                                                                                on Recommender systems (RecSys '09), 93-100.
improvements at the same time.                                             [20] Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J. 1994.
This work provides several interesting directions for future                    "Grouplens: An Open Architecture for Collaborative Filtering of
research. This study shows that iterative smoothing can improve                 Netnews.," Conf. on Comp. Supported Cooperative Work, 175–186.
stability for different recommendation algorithms, providing               [21] Salakhutdinov, R., and Mnih, A. 2008. "Bayesian Probabilistic
                                                                                Matrix Factorization Using Markov Chain Monte Carlo," 25th Intl.
larger improvements for some algorithms and smaller                             Conf. on Machine Learning, Helsinki, Finland: ACM, 880-887.
improvements for some others. Providing some additional                    [22] Sarwar, B., Karypis, G., Konstan, J.A., and Riedl, J. 2001. "Item-
theoretical understanding of what algorithmic and data                          Based Collaborative Filtering Recommendation Algorithms," 10th
characteristics can lead to larger vs. smaller improvements in                  International WWW Conference, Hong Kong, 285 - 295.
recommendation stability for the proposed approach is an                   [23] Shani, G., and Gunawardana, A. 2011. "Evaluating
important direction for future work. Another interesting direction              Recommendation Systems," in Recommender Systems Handbook, F.
would be to perform user behavior studies to investigate the value              Ricci, L. Rokach, B. Shapira and P.B. Kantor (eds.). 257-294.
of stable (i.e., as opposed to unstable) recommendations on users’         [24] Van Swol, L.M., and Sniezek, J.A. 2005. "Factors Affecting the
                                                                                Acceptance of Expert Advice," British Journal of Social Psychology,
usage patterns and acceptance of recommender systems.                           44, (3), 443-461.
ACKNOWLEDGMENTS
This research was supported in part by the National Science
Foundation grant IIS-0546443.


                                                                       8