Iterative Smoothing Technique for Improving Stability of Recommender Systems Gediminas Adomavicius Jingjing Zhang University of Minnesota Indiana University gedas@umn.edu jjzhang@indiana.edu ABSTRACT made by the system. We focus on the measure of recommendation stability, which According to the definition, stability is the consistent agreement reflects the consistency of recommender system predictions. of predictions made on the same items by the same algorithm, Stability is a desired property of recommendation algorithms and when any new incoming ratings are in complete agreement to has important implications on users' trust and acceptance of system’s prior estimations [2]. As has been discussed in prior recommendations. Prior research has reported that some popular work, stability is an important and desired property of recommendation algorithms can suffer from a high degree of recommender systems, and has a number of potential implications instability. In this study we propose a scalable, general-purpose related to users’ trust and acceptance of such systems [2]. iterative smoothing approach that can be used in conjunction with While providing stable and consistent recommendations is different traditional recommendation algorithms to improve their important in many contexts, prior research has demonstrated that stability. Our experimental results on real-world rating data some popular collaborative filtering recommendation algorithms demonstrate that the proposed approach can achieve substantially can suffer from high degree of instability [2]. This is particularly higher stability as compared to the original recommendation true for the widely used item- and user-based nearest-neighbor algorithms. Importantly, the proposed approach not only does not collaborative filtering approaches. It has also been shown that sacrifice the predictive accuracy in order to improve stability does not necessarily correlate with predictive accuracy recommendation stability, but is actually able to provide [2], i.e., different recommendation algorithms can exhibit additional accuracy improvements at the same time. different levels of stability, even though they may have similar 1. INTRODUCTION prediction accuracy. Thus, maximizing accuracy may not necessarily help to improve stability, and vice versa. For instance, Recommender systems represent technologies that assist users in a simple heuristic that predicts any unknown user rating as an finding a set of interesting or relevant items [1]. In order to average of all known ratings of that user is perfectly stable [2]; provide good recommendations, recommender systems employ however, in most real-world settings this heuristic is outperformed users’ feedback on consumed items. This input can include by more sophisticated recommendation algorithms in terms of explicitly provided feedback in the form of ratings or tags, as well predictive accuracy. Therefore, the main objective of this study is as feedback that can be implicitly inferred by monitoring users’ to develop an approach that can improve stability of behavior such as browsing, linking, or buying patterns. The most recommendation algorithms without sacrificing their accuracy. common approach to modeling users’ preferences for items is via numeric ratings. The recommendation algorithm then analyzes In this paper, we propose a general iterative smoothing approach patterns of users’ past ratings and predicts users’ preference to improve stability of any given recommendation technique. The ratings for new, not yet consumed items. Once ratings for the new approach serves as a meta-algorithm, i.e., it can be used in items are estimated, the item(s) with the highest estimated conjunction with any traditional recommendation technique. rating(s) can be recommended to the user. Accordingly, the paper evaluates the performance of the proposed approach in conjunction with a number of popular and widely- In the recommender systems literature, evaluating performance of used recommendation algorithms in terms of their stability as well recommendation algorithms has always been a key issue, and as accuracy on several real-world movie rating datasets. The recommendation accuracy has been the major focus in developing results show that this meta-algorithmic approach provides evaluation metrics [11,23]. As a result, much of the research in substantial improvements in recommendation stability as the recommender systems area has focused on proposing new compared to the original recommendation algorithms, while techniques to enhance the accuracy of recommendation providing some additional accuracy benefits as well. algorithms in predicting what users will like, as exemplified by the recent $1M Netflix prize competition. Prediction accuracy 2. RELATED WORK metrics typically compare the rating values estimated by a Based on how unknown ratings are predicted, recommendation recommendation algorithm against the actual rating values and techniques can be classified into three general categories: content- reflect the closeness of the system’s predictions to users’ true based, collaborative filtering, and hybrid [1,3]. Among different ratings. In addition to recommendation accuracy, researchers recommendation approaches, collaborative filtering techniques have proposed a number of alternative types of measures, have been most widely used, largely because they are domain including recommendation coverage, diversity, novelty, independent, require minimal, if any, information about user and serendipity, and several others, to evaluate the performance of item features, yet can still achieve accurate predictions [13,19]. recommender systems [11,23]. Of special interest to us is the recently introduced measure of recommendation stability [2], In a typical setting of collaborative filtering recommender which reflects the level of consistency among the predictions systems, users’ preferences for items are modeled via numeric ratings. Thus, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been Copyright is held by the author/owner(s). Workshop on Recommendation Utility seen by a user, and this estimation is usually based on the other Evaluation: Beyond RMSE (RUE 2012), held in conjunction with ACM RecSys available ratings given by this and/or other users. More formally, 2012. September 9, 2012, Dublin, Ireland. given a set of users U and a set of items I, the entire user-item 3 sspace is denoted as S = U × I. Let L Rui representt rating that userr u ggave to item i, where w Rui is typ pically known only o for a limiteed ssubset of all po ossible (u,i) pairrs. Let D be thet set of know wn ratings, and S\DD be the set of unknown u ratings. Therefore, th he recommendation n task is to estim mate unknown Rui values for all a (u,i)S\D pairs, given g U, I, and known k R(u,i) vallues for (u,i)  D. D AAs mentioned eaarlier, predictive accuracy has been a major focus inn recommender systems literatu ure. One of the most widely useed Figuree 1. Illustration of stability com mputation, adapteed from [2]. ppredictive accurracy metrics forr recommenderr systems is ro oot mmean squared errror (RMSE), which w we will usse in this paper to Providiing consistent ppredictions has important impllications on report the recom mmendation accu uracy results. However, there iss a users’ ttrust and acceptaance of recommmendations. In thhe consumer ggrowing understaanding that good d recommendatio on accuracy alonne psychoology literature, iit has been show wn that online addvice-giving mmay not give useers a satisfying experience e using the recommend der agents with greater fluuctuations in paast opinions aree considered ssystems, and thaat some other (co omplementary) measures are also less innformative thann those with a more uniform m pattern of immportant in evaluating the effecctiveness of the system [23]. Our O opinionns [9], and advicce inconsistent with past recom mmendations ffocus in this paaper is one of these importan nt complementaary is connsidered less helpful than consistent advvice [7,24]. mmeasures, recom mmendation stabiility [2]. Inconsiistent recommenndations may be discredited by uusers and, as a resullt, the decreasee in users’ trust will further rreduce their RRecommendation n stability meaasures the inheerent consistenccy percepttions of the recoommender system m’s competencee [12,17]. In aamong the differrent predictions made m by the sysstem. Consider ana contextts where conssistency is vittal, the instabbility of a eexample where the t system mak kes predictions forf two movies, i1 recomm mender system is likely to have a negative impaact on users’ aand i2, for user u. Let’s deno ote the two rating predictions as acceptaance and, thus, hharm the successs of the system. RR*(u,i1) and R*(u,i2). Let’s asssume that pred diction R*(u,i1) is pprecisely accuratte and, after userr u consumes item i1, it gets addeed Prior reesearch has provvided a compreheensive investigaation into the too the system as part of the know wn ratings, i.e., R(u,i1) = R*(u,i1). stabilityy of popular re commendation algorithms [2] aand showed TThe recommendaation algorithm re-computes alll other predictions that soome widely usedd algorithms caan be highly unnstable. For inn light of the neew data. Would this change the value of the oth her examplle, it has be en shown thaat RMSS for user-based pprediction, R*(u,i2), and to whaat extent, even though t the new wly collaboorative filtering approach can bbe as high as 00.47 on the inncoming rating data was exactly b the system? In y as predicted by Netflixx movie ratingg data, meaninng that on aveerage every oother words, two predictions R*(u,i R 1) and R*( (u,i2) of the sam me predicteed rating will shhift by 0.47 starrs (i.e., by aboutt a half-star) recommender sy ystem can be vieewed as “inconssistent” with eacch after aadding to the ttraining data soome new ratinngs that are oother, if adding one of them to the training daata for the systeem identicaal to the systemm’s current preddictions [2]. Thhis is a very cchanges the otheer prediction. AsA discussed in [2], the degree of significcant shift in preddiction, considerring the length oof the rating thhe change in predictions reeflects the (in n)stability of th he scale fo for the dataset iss only 4 stars (i..e., ratings go frrom 1 to 5). recommendation n algorithm. Thus, ttechniques for sstability improvvement are needded. In this paper, we propose a ggeneral-purpose meta-algorithm mic approach SSpecifically, the stability of a reecommender sysstem is defined as that ccan be used to improve stability of traditional thhe extent to wh hich the system’’s predictions fo or the same itemms recomm mendation algoriithms. sstay the same/siimilar (i.e., are stable), when any and all neew inncoming rating gs submitted to o the system are in compleete 3. IT TERATIVE E SMOOTH HING APPR ROACH aagreement with system’s s prior predictions. p Bassed on this idea,, a twwo-phase appro oach (illustratedd in Figure 1) for computin ng 3.1 G General Ideea sstability of a reccommendation algorithm a has been introduced in High innstability resultss from predictionns that are inconnsistent with pprior literature [22]. In Phase 1,, given a set off known ratings, a each otther. We propoose an iterative ssmoothing approoach, which ppredictive modell is built, and prredictions for alll unknown ratings involvees multiple iteerations for reepeatedly and collectively aare made. Then, a random subseet of system’s prredictions is addeed adjustinng the rating prredictions of a recommendatioon algorithm too the original seet of known ratinngs as hypothetiical new incomin ng based oon its other preedictions and, tthus, is explicitly aimed at ratings. In Phasse 2, based on expanded e traininng data, a seconnd improvving consistencyy of predicted ratings. The kkey idea of ppredictive modeel is built using the same recommendatio on iterativve smoothing is tthat the rating ppredictions compputed during teechnique, and predictions on remaining unkn nown ratings are a currentt iteration will bbe fed back intto the data to ppredict other mmade. Stability is then measured by comparin ng the two sets of ratings in subsequent itterations. ppredictions to coompute their root mean squared difference, whicch Figure 2 provides aan overview oof the proposeed iterative iss called root meean squared shif ift (RMSS), and is computed in na smoothhing algorithm, aand Figure 3 gives a high-levell illustration ssimilar fashion as RMSE: of the overall process.. Given rating space S and traaining set D where rratings are know wn, predictions oon unknown ratiings S\D are RMSS , , | ∩ | first esttimated using soome standard recommendation aalgorithm T. , ∈ ∩ These ppredictions are denoted as P0. The procedure of iterative smoothhing then starts to iteratively aadjust estimations for each wwhere P1 and P2 are the prediictions made in n Phase 1 and 2, rating iin S\D based onn all other ratinggs in the rating sspace S (i.e., respectively, i.e.,, RMSS captures the shift in preedictions made by b both knnown as well ass predicted) in order to proactively improve thhe same recomm mendation algorrithm with new ratings r that are in consisteency between diifferent predictedd ratings. ccomplete agreemment with algorith hm’s own prior estimations. e More sspecifically, duriing the k-th iteraation (for any k = 1, 2, …), for eacch unknown userr-item pair (u,i))S\D, a model fk,u,i is built based oon training dataaset Dk,u,i. Thiis dataset is connstructed to containn all known ratinngs combined w with all predictioons on user- 4 ittem pairs in S\DD computed in th he previous (k-1) iteration, exceept ffor the predictionn on (u,i) as indicated in Figure 2. 2 As the result of thhe k-th iteration n, each model fk,u,i k produces a new n prediction for f (u,i) that is stored d as Pk(u,i), i.e.: , , , ∈ , , , , , , ∈ \ HHere R(u,i) repreesents the know wn rating that usser u gave item i, aand fk,u,i is the prediction p modeel built based onn Dk,u,i. Figure 4 illlustrates the process p within each iteration of the iterativ ve ssmoothing appro oach. The set ofo predictions maade in the curreent itteration is com mpared with preedictions made in the previou us itteration to comp pute the deviation between the tw wo sets (measureed inn root mean squ uared difference). The iterative smoothing s proceess sstops either afterr a fixed, pre-deetermined number of iterations or Figurre 3. Illustration of the general itterative smoothinng process. wwhen predictionss on unknown raatings do not chaange. BBy definition, in each iteration, a separate modell is built for eveery uuser-item pair wiith unknown ratiing. Thus, in tottal, |S\D|K modeels nneed to be constrructed over the course of K iterrations, each usin ng |SS|–1 ratings as a training datasset. If t(x) is th he time needed to bbuild a predicttive model on data sample of size x usin ng recommendation n algorithm T, then t the time complexity c of thhe itterative smoothiing approach is O(|S\D|Kt(|S|)). O . Iterative Smoothing Algorithm m: Inputs: known n ratings data D, # of iterations K, K algorithm T Process: 1. Build modell f0 on known ratings D using g some standard d recommendation n algorithm T, i.e., f0T(D) 2. Apply modeel f0 to compu ute predictions P0 for unknown n ratings S\D, i.e.,, P0(u,i) = f0(u,i)) for (u,i)S\D Figure 4. Illusstration of one sm moothing iteratiion. 3. For each iteraation k  {1, …, K} For each unk known rating paair (u, i)S\D 3.2 S Scalable Iterative Smooothing a. Construcct dataset Dk,u,i by b including all known ratings D We prropose a variatiion of iterativee smoothing appproach that and all predicted p ratings Pk-1 from the prrevious iteration n, substanntially simplifiees the original approach and ssignificantly except fo or rating Pk-1(u,i)), i.e., reducess its computationnal requirementss. Dk,u,i = D ∪ Pk-1\ {Pk-1(u,i)} Figure 5 provides an ooverview of the proposed scalaable iterative b. Build mo odel fk,u,i on dataaset Dk,u,i using T, T i.e., smoothhing algorithm.. Similarly to the original iterative fk,u,i  T (Dk,u,i) smoothhing approach, ththe proposed scaalable version allso involves c. Make preediction on (u, i)) and store in P k, i.e., multiplle iterations for repeatedly adjuusting estimatioons for each Pk(u, i) = fk,u,i(u, i) unknowwn rating. How wever, in each itteration, instead of building 4. Output predicctions made in th he final iteration n PK one moodel for each unkknown rating, onnly one single m model is built Output: PK upon alll known ratingss and predicted ratings in previouus iteration. Figure 2. Iterative smoothing s appro oach. Specifiically, during k-tth iteration (for aany k = 1, 2, …)), model fk is Inn other words, thhe complexity ofo iterative smoo othing algorithm is built bbased on traininng dataset Dk w which contains all known pproportional to the t size of unkn nown ratings in the rating spacce. ratings D combined wiith all predictionns Pk-1 on user-item pairs in TThis computatio onal requiremen nt is likely to be prohibitiveely S\D coomputed in thee previous iteraation. Pk(u,i) denotes the eexpensive for many m real-world scenarios, i.e., anytime the useer- predicteed rating for ((u,i) in the k iteration by thee collective ittem rating spacee is large and raating data is sparrse. For examplle, inferennce approach, deffined as follows: thhe Movielens 100K dataset is a publicly available movie ratin ng , , , ∈ ddataset [10], wh hich is considerred to be relativ vely small. Th his , ddataset contains 100,000 movie ratings from 943 users on 168 82 , , , ∈ \ mmovies. Thus, th he total number of possible ratin ngs is about 1.6M Here RR(u,i) represents the known ratiing that user u ggave item i. (i.e., 1682 x 94 43), of which 1.5M 1 is unknow wn. If we app ply As the result of the k-thth iteration, for eeach (u,i)S\D, the model fk itterative smoothiing algorithm on n the Movielenss 100K dataset, in producees a new preddiction for (u,i)) that is storedd as Pk(u,i). tootal 1.5 million n predictive mo odels (each built on 1.6 millio on Similarrly to the original approach, new w predictions arre compared ratings) would need to be co onstructed withiin each iteratio on. with prredictions made in the previous iteration and thhe procedure TTherefore, applying the full iteraative smoothing approach may not n ends eiither after a fixeed number of iteerations or whenn predictions bbe feasible even for datasets thaat are not very large. l In order to on unknnown ratings converge (i.e., do nnot change). oovercome this co omplexity issue, in the next secttion we proposee a The keey difference bbetween this sim mplified variatiion and the vvariation of th he iterative sm moothing algoriithm that offeers originaal iterative smoothing algorithm is the number oof predictive ssignificant scalabbility improvem ments while retaining most of th he The original algorithm builds modelss built within eacch iteration k. T sstability benefits (as will be show wn in the Resultss section). a separrate model for evvery unknown rrating in the ratinng space, in order too properly adjusst the predicted rrating Pk(u,i) usiing all other 5 ratings from previous iteration, i.e., all ratings from D as well all was used for building rating prediction models, while validation ratings Pk-1(u',i') where (u',i')  (u,i). set DV was reserved exclusively for evaluating the predictive accuracy of the final predictions. Similarly, a randomly chosen Scalable Iterative Smoothing Algorithm: half of the unknown rating space ET was dedicated for the stability Inputs: known ratings data D, # of iterations K, algorithm T evaluation of rating prediction models during the training phase, Process: and the other half of the unknown rating space EV was reserved 1. Build model f0 on known ratings D using some standard exclusively for proper evaluation of the stability of the final recommendation algorithm T, i.e., f0T(D) predictions. Here, S\D = ET  EV and ET  EV = . 2. Apply model f0 to compute predictions P0 for unknown ratings S\D, i.e., P0(u,i) = f0(u,i) for (u,i)S\D Step 0: Create training and test datasets, i.e., DT, DV, ET, EV. 3. For each iteration k  {1, …, K} Step 1: Find the best model parameters: a. Construct dataset Dk by including all known ratings D i. Use a portion (e.g., 75%) of training set DT to build rating and all predicted ratings Pk-1 from the previous iteration, prediction models using iterative smoothing. i.e., ii. Compute model accuracy on other portion (25%) of DT. Dk = D ∪ Pk-1 iii. Compute model stability on predictions made on ET. b. Build model fk on dataset Dk using T, i.e., iv. Repeat steps i-iii with various parameter settings for fk  T(Dk) iterative smoothing (i.e., number of iterations). c. For each unknown rating pair (u, i)S\D , make v. Find the best parameters for iterative smoothing, i.e., the prediction on (u, i) and store in Pk, i.e., specifications that result in best stability and accuracy. Pk(u, i) = fk(u, i) Step 2: Evaluate accuracy and stability of the final predictions 4. Output predictions made in the final iteration PK i. Use the best parameter settings for iterative smoothing. Output: PK ii. Train the system on the entire training set DT using Figure 5. Overview of scalable iterative smoothing approach. iterative smoothing. iii. Evaluate predictive accuracy on the reserved validation In contrast, the simplified algorithm builds only one predictive rating set DV. model in each iteration, based on the entire rating matrix. In other iv. Evaluate recommendation stability on the reserved words, predicted rating Pk(u,i) is adjusted using all ratings from unknown rating space EV. previous iteration, i.e., all ratings from D as well as from Pk-1, Step 3: Report parameter setting(s) as well as the accuracy and including Pk-1(u,i). Thus, in the simplified algorithm, for any stability of the final predictions. given rating prediction P1(u,i) in the first iteration, the predictive Figure 6. Overall experimental process. model is built on a rating data (i.e., D1) that only differs from the rating data used in the original algorithm by one additional rating Additionally, the process of iterative smoothing involves multiple (i.e., D1\{P0(u,i)}). Because the influence of one additional rating iterations to adjust the predictions of unknown ratings. One of the is often subtle, especially when entire rating space is large (i.e., in goals in this study is to find whether predictions converge during settings with large numbers of users and items), the single overall the process of iterative smoothing and, if so, when. In addition, model build in the simplified algorithm should produce outcomes there is possibility that iterative smoothing models can “over- similar to the ones produced by individual models built in the adjust” rating predictions after a number of iterations, in their original algorithm, especially in the first iteration. While the attempt to maximize the performance on training data. Over- difference between the original and simplified versions of the fitting is a well-known phenomenon which occurs when a iterative smoothing may slowly increase as the number of predictive model is fine-tuned to fit the training data (including iterations grows, the simplified approach still provides significant the random errors, outliers, and noise in the data) too well, which performance improvements (both in stability and accuracy), as typically leads to diminished predictive performance on test data. demonstrated by the experimental results later in the paper. Therefore, for a given algorithm, it is necessary to find the best Moreover, the runtime complexity of the simplified algorithm is number of iterations to use on a given dataset in the final iterative much lower, making it much more practical from the scalability smoothing procedure. In order to find the optimal parameter perspective. In particular, as only one overall model is built on all settings, we further used the standard train-test data splitting available ratings (i.e., dataset of size |S|) within each iteration, in approach internally within the training data to identify the best total K models are constructed over the course of K iterations. parameter values for the proposed approach, i.e., that result in best Thus, the time complexity of the simplified variation is accuracy and stability. The overall experiment process is O(Kt(|S|)). Comparing this to the complexity of the original summarized in Figure 6. algorithm, O(|S\D|Kt(|S|)), the scalable heuristic offers huge 4.2 Recommendation Algorithms computational improvements (i.e., by roughly |S\D| times). In this In our experiments, we test the proposed approach in conjunction paper, we use scalable iterative smoothing in our experiments. with four popular recommendation algorithms: the simple 4. EXPERIMENTAL RESULTS baseline method, classic user- and item-based variations of neighborhood-based CF approaches, and the matrix factorization 4.1 Overall Process technique. A brief overview of each technique is provided below. Our experiments test the stability improvements achieved by the proposed meta-algorithmic approach in conjunction with several Baseline. In real-world settings, some users may systematically popular collaborative filtering techniques. tend to give higher ratings than others, and some universally liked items might receive higher ratings than others. Without The experiments follow the two-phase stability computation from normalization, such user and item effects could bias system’s prior literature [2], discussed in Section 2. We used the standard predictions. Hence, recommender systems often involve a pre- train-test data splitting approach and divided known ratings data processing step to remove these “global effects”. One common D into two sets: training data DT (80%) and validation data DV practice is to estimate and remove three effects: the overall mean, (20%), where D = DT  DV and DT  DV = . Training set DT the main effect of an item, and the main effect of a user [4]. Such 6 “global effects” can serve as a baseline estimate for unknown approximation error. After the two sub-matrices are learned using rating of corresponding user and item, i.e., known ratings, each unknown rating is estimated as a dot-product bui = µ + bu + bi , of the corresponding user- and item-factors vectors. Many variations of matrix factorization techniques have been developed where µ is the overall average rating, bu is the average observed during the recent Netflix Prize competition (e.g., [14,15,18,21]). deviation from µ on ratings provided by user u, and bi is the Our experiments focus on the basic underlying version of the average observed deviation from µ on ratings given to item i. matrix factorization [8]; however, the proposed meta-algorithmic Note that, in all of our experiments (i.e., with all other approach can be applied with any variation of this technique. recommendation algorithms), the ratings data were normalized by removing these global effects. Moreover, this estimate is often 4.3 Results: Comparing Iterative Smoothing used as a baseline recommendation technique for comparison with with Standard Recommendation Techniques other recommendation algorithms, i.e., R*(u,i) = bui, and we The objective of the experiment is to compare the performance of investigate its performance in our experiments as well. the proposed iterative smoothing approach with standard single- User-Based Collaborative Filtering (CF_User). The user-based model recommendation techniques on several real world datasets. nearest-neighbor collaborative filtering approach is a heuristic that The first dataset we used is the Movielens 100K dataset [10], makes predictions of unknown ratings for a user based on the which contains 100,000 known ratings on 1682 movies from 943 ratings previously rated by this user’s “nearest neighbors”, i.e., users (6.3% data density). Our second dataset is a sample other users who have similar rating patterns [6,20]. That is, the extracted from the Movielens 1M dataset. The original Movielens value of the unknown rating for user u and item i is usually 1M dataset consists of 1,000,000 ratings for 6040 movies by 3952 computed as an aggregate of the neighbors’ ratings for the same users (4.2% data density) [10]. From this dataset we extracted a item i. The most common aggregation approach is the weighted random sample of 3000 users and 3000 movies. Resulted dataset sum of the neighbors’ ratings, where the similarity of two users is contains 400,627 known ratings (i.e., 4.45% data density). Our used as a weight. I.e., the more similar user u' and target user u third dataset is sampled from the Netflix 100M dataset used in the are, the more weight will be carried by the rating provided by user recent Netflix Prize competition [5]. Similarly to the second u' on item i in the weighted sum when computing the prediction. dataset, we sub-sampled 3000 random users and 3000 random Predicted rating for user u on item i is computed as: movies from the original data file. The result data sample consists ∗ ∑ ∈ , ∗ , of 105,256 known ratings (i.e., 1.17% data density). The three , datasets used in our experiments come from different sources and ∑ ∈ , | | have different data characteristics (i.e., size and sparsity). All where N(u,i) is a set of “neighbors” with similar rating patterns to movie ratings in the Movielens and Netflix datasets are integer user u and that have provided ratings for item i, simuv is the values between 1 and 5, where 1 represents the least liked movies, similarity between users u and v, and bui is the baseline estimate and 5 represents the most liked movies. The datasets used in this for user u on item i. In our implementation, two users must have experiment are summarized in Table 1. rated at least 3 items in common to allow computation of similarity between them. The similarity between two users is Table 1. Summary of Experimental Datasets. calculated as Pearson correlation between rating vectors (based on DataSet Description Users Items Density the commonly rated items) of the two users. Prediction of each Movielens Movie ratings from 943 1682 6.30% unknown rating is formulated by combining the preferences of 20 100K Movielens movie most similar users who have rated the same item. Movielens recommender system. 3000 3000 4.45% 1M Item-Based Collaborative Filtering (CF_Item). The user-based Movie ratings collaborative filtering technique also has an analogous item-based Netflix distributed by Netflix 3000 3000 1.17% version, where the ratings of the nearest-neighbor items are used company. to predict unknown ratings for a given item. Several studies have presented empirical evidence that item-based algorithms often The procedure of this experiment followed the general provide better predictive accuracy than user-based methods (e.g., experimental process described in Figure 6. We examined the [22]). Thus, our experiments also test the standard item-based prediction accuracy and stability of the final predictions on the collaborative filtering in conjunction with the proposed approach. reserved validation datasets (as described in Figure 6, Step 2). Similarly to the settings employed in user-based CF, in our For each recommendation algorithm used in our study, we experiments, two items are required to have been rated by 3 compare two approaches: the standard (original) single-model common users to allow similarity evaluation between them, and approach and the scalable version of iterative smoothing 20 nearest-neighbor items are used to formulate a prediction. approach. Accuracy and stability numbers (measured by RMSE and RMSS) of the two approaches on real-world movie rating Matrix factorization (SVD). Matrix factorization technique is a datasets are provided in Table 2. model-based (as opposed to heuristic-based) collaborative filtering approach that characterizes items and users via a number Experimental results are consistent across different datasets. On of latent factors inferred from known ratings [8,15]. This all three datasets, our proposed meta-algorithmic iterative technique models the U×I rating space as a product of two sub- smoothing approach outperformed the original recommendation matrices: user preference matrix (U×L) and item feature matrix techniques in both stability and accuracy in the vast majority of (L×I). Each user and item is described by a vector of L latent cases. In particular, on average (i.e., across all datasets), iterative variables. In our experiments L is set to be 20. The user vector smoothing provided a dramatic 55% improvement over the indicates the preference of the user for several latent features, and original recommendation algorithms in stability (as measured by the item vector represents an item’s importance weights for the RMSS) for CF_User and CF_Item algorithms. Even for the fairly same latent features. Singular value decomposition (SVD) stable SVD and baseline techniques, on average, iterative techniques are used to decompose original rating matrix into the smoothing was able to further improve RMSS by 14%. In terms two sub-matrices in an optimal way that minimizes the resulting of predictive accuracy, on average, iterative smoothing provided 7 1.4% improvements in RMSE across different algorithms. REFERENCES Table 2. Iterative Smoothing vs. Standard Techniques. [1] Adomavicius, G., and Tuzhilin, A. 2005. "Toward the Next Generation of Recommendation Systems: A Survey of the State-of- Accuracy Stability the-Art and Possible Extensions," IEEE Transactions on Knowledge Method Approach (RMSE) (RMSS) and Data Engineering, 17, (6), 734-749. Standard 0.9437 0.0866 [2] Adomavicius, G., and Zhang, J. 2010. "On the Stability of SVD Smoothing 0.9378 0.0779 Recommendation Algorithms," ACM Conference on Recommender Standard 0.9684 0.4023 systems, Barcelona, Spain: ACM New York, NY, USA 47-54. CF_User [3] Balabanovic, M., and Shoham, Y. 1997. "Fab: Content-Based, Movielens Smoothing 0.9586 0.2020 100K Standard 0.9560 0.3234 Collaborative Recommendation," Comm. of ACM, 40, (3), 66-72. CF_Item [4] Bell, R.M., and Koren, Y. 2007. "Scalable Collaborative Filtering Smoothing 0.9320 0.1347 Standard 0.9758 0.1148 with Jointly Derived Neighborhood Interpolation Weights " Seventh Baseline IEEE International Conference on Data Mining, Omaha, NE, USA. Smoothing 0.9609 0.0569 [5] Bennett, J., and Lanning, S. 2007. "The Netflix Prize," KDD-Cup Standard 0.8846 0.0804 SVD and Workshop, San Jose, CA. Smoothing 0.8798 0.0719 [6] Breese, J.S., Heckerman, D., and Kadie, C. 1998. "Empirical Standard 0.9393 0.3294 Analysis of Predictive Algorithms for Collaborative Filtering," 14th CF_User Movielens Smoothing 0.9182 0.1145 Conf. on Uncertainty in Artificial Intelligence, Madison, WI. 1M Standard 0.9135 0.2755 [7] D’astous, A., and Touil, N. 1999. "Consumer Evaluations of Movies CF_Item Smoothing 0.8929 0.1089 on the Basis of Critics' Judgments," Psychology & Marketing, 16, Standard 0.9425 0.0910 677–694. Baseline Smoothing 0.9346 0.0923 [8] Funk, S. 2006. "Netflix Update: Try This at Home." Standard 0.9372 0.1208 [9] Gershoff, A., Mukherjee, A., and Mukhopadhyay, A. 2003. SVD "Consumer Acceptance of Online Agent Advice: Extremity and Smoothing 0.9363 0.1137 Standard 0.9608 0.4610 Positivity Effects," J. of Consumer Psychology, 13, (1&2), 161-170. CF_User [10] Grouplens 2011. "Movielens Data Sets." Smoothing 0.9407 0.2462 Netflix [11] Herlocker, J.L., Konstan, J.A., Terveen, K., and Riedl, J.T. 2004. Standard 0.9579 0.4394 CF_Item "Evaluating Collaborative Filtering Recommender Systems," ACM Smoothing 0.9381 0.2291 Standard 0.9622 0.1465 Transactions on Information Systems, 22, (1), Jan, 5-53. Baseline [12] Komiak, S., and Benbasat, I. 2006. "The Effects of Personalization Smoothing 0.9556 0.1351 and Familiarity on Trust and Adoption of Recommendation Agents," MIS Quarterly, 30, (4), 941-960. 5. CONCLUSIONS AND FUTURE WORK [13] Koren, Y. 2009. "The Bellkor Solution to the Netflix Grand Prize." This paper introduces a general-purpose, practical meta- from http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf. algorithmic approach – based on iterative smoothing – for [14] Koren, Y. 2008. "Factorization Meets the Neighborhood: A improving stability of a variety of traditional recommendation Multifaceted Collaborative Filtering Model," ACM Intl. Conf. on algorithms. The iterative smoothing approach uses multiple Knowledge Discovery and Data Mining (KDD'08), 426-434. iterations to repeatedly and explicitly adjust predictions of a [15] Koren, Y., Bell, R., and Volinsky, C. 2009. "Matrix Factorization Techniques for Recommender Systems," IEEE Computer, 42, 30-37. recommendation algorithm based on its other predictions in order [16] Macskassy, S.A., and Provost, F. 2007. "Classification in Networked to make them more consistent with each other. We examined the Data: A Toolkit and a Univariate Case Study," Journal of Machine performance of iterative smoothing approach on several real Learning R, 8, 935-983. world datasets. Our experiments show that the proposed approach [17] O'Donovan, J., and Smyth, B. 2005. "Trust in Recommender demonstrates effectiveness in their ability to improve stability for Systems," 10th Intl. Conf. on Intelligent User Interfaces. several widely used recommendation algorithms. Perhaps as [18] Paterek, A. 2007. "Improving Regularized Singular Value importantly, the proposed approach not only does not sacrifice the Decomposition for Collaborative Filtering," KDDCup, 39-42. predictive accuracy to obtain these stability improvements, but [19] Pilaszy, I., and Tikk, D. 2009. "Recommending New Movies: Even a Few Ratings Are More Valuable Than Metadata,"Third ACM Conf. actually is able to provide some additional accuracy on Recommender systems (RecSys '09), 93-100. improvements at the same time. [20] Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J. 1994. This work provides several interesting directions for future "Grouplens: An Open Architecture for Collaborative Filtering of research. This study shows that iterative smoothing can improve Netnews.," Conf. on Comp. Supported Cooperative Work, 175–186. stability for different recommendation algorithms, providing [21] Salakhutdinov, R., and Mnih, A. 2008. "Bayesian Probabilistic Matrix Factorization Using Markov Chain Monte Carlo," 25th Intl. larger improvements for some algorithms and smaller Conf. on Machine Learning, Helsinki, Finland: ACM, 880-887. improvements for some others. Providing some additional [22] Sarwar, B., Karypis, G., Konstan, J.A., and Riedl, J. 2001. "Item- theoretical understanding of what algorithmic and data Based Collaborative Filtering Recommendation Algorithms," 10th characteristics can lead to larger vs. smaller improvements in International WWW Conference, Hong Kong, 285 - 295. recommendation stability for the proposed approach is an [23] Shani, G., and Gunawardana, A. 2011. "Evaluating important direction for future work. Another interesting direction Recommendation Systems," in Recommender Systems Handbook, F. would be to perform user behavior studies to investigate the value Ricci, L. Rokach, B. Shapira and P.B. Kantor (eds.). 257-294. of stable (i.e., as opposed to unstable) recommendations on users’ [24] Van Swol, L.M., and Sniezek, J.A. 2005. "Factors Affecting the Acceptance of Expert Advice," British Journal of Social Psychology, usage patterns and acceptance of recommender systems. 44, (3), 443-461. ACKNOWLEDGMENTS This research was supported in part by the National Science Foundation grant IIS-0546443. 8