Reproducing and Prototyping Recommender Systems in R Ludovik Çoba, Panagiotis Symeonidis, Markus Zanker Free University of Bozen-Bolzano, 39100, Bozen-Bolzano, Italy {lucoba,psymeonidis,Markus.Zanker}@unibz.it Abstract. In this paper we describe rrecsys, an open source extension package in R for rapid prototyping and intuitive assessment of recom- mender system algorithms. Due to its wide variety of implemented pack- ages and functionalities, R language represents a popular choice for many tasks in Data Analysis. This package replicates the most popular collab- orative filtering algorithms for rating and binary data and we compare results with the Java-based LensKit implementation for the purpose of benchmarking the implementation. Therefore this work can also be seen as a contribution in the context of replication of algorithm implementa- tions and reproduction of evaluation results. Users can easily tune avail- able implementations or develop their own algorithms and assess them according to the standard methodology for offline evaluation. Thus this package should represent an easily accessible environment for research and teaching purposes in the field of recommender systems. 1 Introduction R represents a popular choice in Data Analytics and Machine Learning. The software has low setup cost and contains a large selection of packages and func- tionalities to enhance and prototype algorithms with compact code and good visualization tools. Thus R represents a suitable environment for exploring the field of recommender systems. Therefore we present and contribute a novel R package that reproduces several of the most popular recommender algorithms for Likert scaled as well as binary rating values. The functionality of this framework is mainly focused on prototyping and educational purposes due to the compact code representation and the interactive way of invoking and visualizing results in R. We took advantage of the large selection of packages in R to implement the algorithms included in our package. We achieve competing performance due to highly vectorized and mixed R/C++ implementations. We therefore introduce rrecsys[1, 2] by presenting a general overview of the implemented algorithms and the evaluation methodology. A few code examples are included in an appendix. Furthermore, we proceed by comparing results of rrecsys with those of the Lenskit library. We also include evidence on runtime performance that document the efficient implementation of algorithms in R. 2 rrecsys Package rrecsys has a modular structure as well as includes expansion capabilities. The core of the package includes the implementation of several popular algorithms, an evaluation component and a couple of auxiliary parts for data analysis and convergence detection. Next we concisely describe the included algorithms. 2.1 Algorithms Baseline and Popularity: The included baseline predictors are the global mean rating (Global Average), item’s mean rating (Item Average), user’s mean ratings (User Average) as well as an unpersonalized Most Popular method that determines item popularity based on the total number of (positive) ratings. Item Based K-nearest neighbors: Given a target user and her positively rated items, the algorithm will identify the k-most similar items for each target a and will rank them according to aggregated similarities with the different targets as described by Sarwar et al. [14]. For similarity measures we develop the cosine similarity and the adjusted cosine similarity. Items a and b are considered as rating vectors a and b in the user space. Cosine similarity measures the cosine angle between those vectors: a·b sim(a, b) = cos(a, b) = (1) |a| ∗ |b| The adjusted cosine similarity is computed by offsetting the user average on each co-rated pair on two item vectors. If Ia and Ib is the set of users that rated correspondingly item a and b the adjusted cosine similarity is measured as: P (Rua − Ru ) ∗ (Rub − Ru ) u∈Ia ∩Ib sim(a, b) = r P P (2) (Rua − Ru )2 ∗ (Rub − Ru )2 u∈Ia ∩Ib u∈Ia ∩Ib Where Ru is the average rating of user u, computed as: X Rui Ru = (3) i∈Iu |Iu | Where Iu is the set of items rated by user u. Once similarities among all items are computed a neighborhood might be formed by choosing the items with the highest similarity value. Prediction over a target user u and item a are calculated as the weighted sum [14]. P (sa,k ∗ Ru,k ) all similar items,k pu,a = P (4) sa,k all similar items,k User Based K-nearest neighbors: Herlocker et al. [7] proposed the al- gorithm that finds similarities among users instead among items. In our imple- mentation we consider the cosine similarity and the Pearson correlation. User u and v are considered as rating vectors u and v in the item space. Co- sine similarity measures the cosine angle between those vectors using Formula 1 Instead the Pearson correlation is measured by offsetting the user average on co-rated pairs among the user vectors: P (Rui − Ru ) ∗ (Rvi − Rv ) i∈Iu ∩Iv sim(u, v) = P earson(u, v) = r P P (5) (Rui − Ru )2 ∗ (Rvi − Rv )2 i∈Iu ∩Iv i∈Iu ∩Iv Where Ru and Rv are correspondingly the average ratings of user u and user v, computed like in Formula 3. Once similarities among all items are computed a neighborhood might be formed by choosing the items with the highest similarity value. Prediction over a target user u and item a are calculated as the weighted sum [14]: P (sa,k ∗ Ru,k ) all similar items,k pu,a = Ru + P (6) sa,k all similar items,k Weighted Slope One: proposed by Lemire et al [10] performs prediction for a missing rating r̂ui for user u on item i as the following average: P ∀ruj (devij + ruj )cij r̂ui = P . (7) ∀ruj cij The average deviation rating devij between co-rated items is defined by: X rui − ruj devij = . (8) cij ∀u∈users Where cij is the number of co-rated items between items i and j and rui is an existing rating for user u on item i. The Weighted Slope One takes into account both, information from users who rated the same item and the number of observed ratings. Simon Funk’s SVD: Matrix factorization methods are used in recom- mender systems to derive a set of latent factors, from the user × item rating matrix, to characterize both users and items by this vector of factors. The user- item interaction are modeled as inner product of the latent factors space [4]. Accordingly each item i will be associated with a vector of factors Vi , and each user u is associated with a vector of factors Uu . An approximation of the rating of a user u on an item i can be derived as the inner product of their factor vectors: R̂ui = µ + bi + bu + Uu ∗ ViT (9) Where µ is the overall average rating and bu and bi indicate the deviation due to user u and item i from the mean rating. The U(user) and V(item) factor matrices are cropped to k features and ini- tialized at small values. Each feature is trained until convergence (where con- vergence specifying the number of updates to be computed on a feature before considering it converged, it can be either chosen by the user or calculated auto- matically by the package). On each loop the algorithm predicts R̂ui , calculates the error and the factors are updated as follows: eui = Rui − R̂ui (10) Vik ← Vik + λ ∗ (eui ∗ Uuk − γ ∗ Vik ) (11) Uuk ← Uuk + λ ∗ (eui ∗ Vik − γ ∗ Uuk ) (12) The attribute λ represents the learning rate, while γ corresponds to the regular- ization term. In addition, the following two algorithms address the One Class Collaborative Filtering problem (OCCF). Bayesian Personalized Ranking: The algorithm has been introduced by [12]. It turns the OCCF into a ranking problem by implicitly assuming that users prefer items they have already interacted with another time. Instead of applying rating prediction techniques, BPR ranks candidate items for a user without calculating a ”virtual” rating. The overall goal of the algorithm is to find a personalized total ranking >u ⊂ I 2 for any user u ∈ U sers and pairs of items (i, j) ∈ I 2 that meet the properties of a total order (totality, anti-symmetry, transitivity). Weighted Alternated Least Squares: We compute a low-rank approxi- mation matrix R̂ = (R̂ij )m×n = U ∗ V T , where U and V are the usual feature matrix cropped to k features as introduced by [11]. Weighted low-rank aims to determine R̂ such that minimizes the Frobenius loss of the following objective function: X L(R̂) = L(U, V ) = Wij (Rij − Ui ∗ VjT )2 + λ ∗ (kUi k2F + kVi k2F ) (13) ij The regularization term weighted by λ is added to prevent over-fitting. The expression k.kF denotes the Frobenius norm. The alternated least square algo- rithms optimization process solves partial derivatives of L with respect to each entry U and V , ∂L(U,V ∂Ui ) = 0 with fixed V and ∂L(U,V ∂Vj ) = 0 with fixed U , to com- pute Ui and Vi . Then U and V are initialized with random Gaussian numbers with mean zero and small standard deviation and are updated until convergence. m×n The matrix W = (Wij )m×n ∈ R+ is a non-negative weight matrix that assigns confidence values to observations (hence the name weighted ALS). 2.2 Recommendation List The library is currently distributing two different methodologies for computing the top-N recommendation. The first is the Highest Predicted Ratings (HPR), which proposes a sorted list based on the highest computed rating values by an algorithm. The second method is the Most Frequent (MF), that determines the top-N list based on the most frequent items available in the neighborhood of an user or item. This methodology is known to produce better performance than HPR [9, 15]. 2.3 Evaluation The evaluation module is based on the k-fold cross-validation method. A strati- fied random selection procedure is applied when dividing the rated items of each user into k folds such that each user is uniformly represented in each fold, i.e. the number of ratings of each user in any fold differs at most by one. For k-fold cross validation each of the k disjunct fractions of the ratings are used k − 1 times for training (i.e. ⊂ Rtrain ) and once for testing (i.e. ⊂ Rtest ). Practically, ratings in Rtest are set as missing in the original dataset and predictions/recommendations are compared to Rtest to compute the performance measures. We included the most popular performance metrics according to the survey in [8]. These are mean absolute error(MAE), root mean squared error(RMSE), Precision, Recall, F1, True and False Positives, True and False Negatives, nor- malized discounted cumulative gain (NDCG), rank score, area under the ROC curve (AUC) and catalog coverage[5]. 3 Experimental results In this Section, we compare our rrecsys library with the popular Lenskit [3] Java library. In Figure 1, we compare both libraries in terms of RMSE and MAE, using 5-fold cross validation. Lenskit and rrecsys were configured both with the same algorithms and evaluation methodology. We used the MovieLens100K dataset [6] for these experiments as the purpose is not only the results per se, but to demonstrate the reproducibility of the results derived from LensKit. For the experiments in Figure 1 on the FunkSVD, we have tuned its pa- rameters as follows, the latent space is set to 100 features, the learning rate is set to 0.001, the regularization term is set to 0.015. In the case of the item based k-nearest neighbor algorithm, we have set the number of nearest neighbors to 100, and adjusted cosine similarity is the similarity measure. In the case of the user-based k-nearest neighbor algorithm, we have set the number of nearest neighbors to 100, and used Pearson as similarity measure. As shown in Figure 1, all the reported results demonstrate our ability to clearly reproduce and replicate the same results with those of Lenskit in terms of both RMSE and MAE while other libraries failed to do so [13]. The insignificant differences with LensKit are the result of a random distribution of items in the k-folds. Since BPR and wALS are not implemented in LensKit it is impossible for us to compare results. In Table 1 we show the performance of rrecsys based on the latest imple- mentations with R/C++ code on the MovieLens100K dataset running on the same machine. It is noticeable that rrecsys performs similarly to LensKit. We compared only optimized algorithms. In future we will provide optimized imple- mentation of more state of the art algorithms. RMSE rrecsys RMSE Lenskit MAE rrecsys MAE LensKit 1.5 1.13 1.13 1.05 1.04 1.04 1.02 0.95 0.95 0.94 0.94 0.94 0.94 0.93 0.92 0.92 0.92 0.84 0.84 0.83 0.82 1 0.75 0.74 0.74 0.73 0.73 0.72 0.72 0.72 0.5 ean ean ean ne D ase d ase d lM M rM eO SV ba tem p nk ser B B glo i use Slo Fu U Item W Fig. 1. Benachmark with LensKit. IBKNN (neigh.) UBKNN (neigh.) Funk SVD (feat.) WSlopeOne Framew. 10 100 500 10 100 500 10 100 500 rrecsys 14.91s 14.99s 15.67s 8.77s 8.87s 9.48s 6.23s 15.34s 54.71s 49.46s Lenskit 14.17s 14.62s 15.28s 60.04s 61.48s 62.32s 4.32s 11.05s 46.82s 16.41s Table 1. Evaluation times with optimized implementations using R/C++ and com- parison with LensKit. 4 Using the library In this Section, we introduce an executable script in R for running some of the functionalities of rrecsys in order to demonstrate its intuitive use. Please notice that due to space limitations in this paper we do not describe all commands in detail. The library is distributed with a full range of vignettes and a manual describing all available functionalities1 . The package is loaded on the Comprehensive R Archive Network (CRAN), therefore download, installation and loading of the package requires the execu- tion of the following two functions: install.packages("rrecsys"); library(rrecsys) Once the package is loaded the dataset MovieLens Latest2 will be available within the environment. A setup of the data is required to define possible limits and the structure of the dataset. Users can explore the dataset by checking the 1 https://cran.r-project.org/package=rrecsys 2 We redistribute the MovieLens Latest datasets for demonstration purposes only. Please notice that these datasets change over time and are not appropriate for re- porting experimental results. number of ratings, its sparsity or even by modifying it to contain a specific number of ratings for each item\user. data("mlLatest100k") m <- defineData(mlLatest100k, minimum = 1, maximum = 5, halfStar = TRUE) sparsity(m); numRatings(m); rowRatings(m); colRatings(m) #Crop the dataset to contain at least 200 ratings on each user and 10 ratings on each item. smallmlLatest <- m[rowRatings(m) >= 200, colRatings(m) > 10] The following code shows how to train a model (e.g., ub10) on an algorithm (e.g., UBKNN), which can be used for either rating prediction (e.g., p) or item recommendation (e.g., rHPR and rMF). ub10 <- rrecsys(smallmlLatest, "UBKNN", neigh = 10, simFunct = 1) p <- predict(ub10) rHPR <- recommendHPR(ub10, topN = 10) #pt is the positive threshold for recommending an item. rMF <- recommendMF(ub10, topN = 10, pt = 3) The following code shows how we generate the k-folds. Same fold distribution can be used to evaluate different algorithms. folds <- evalModel(smallmlLatest, folds = 2) #Recommendation evaluation. evalRec(folds, "UBKNN", topN = 10, goodRating = 3, simFunct = 2, recAlg = 1) The output of the evaluation function looks like as follows: #Prediction evaluation. > evalPred(folds, "funksvd", k = 10) #Output: MAE RMSE globalMAE globalRMSE Time 1-fold 0.8565404 1.056737 0.9175207 1.161164 2.419723 2-fold 0.8473669 1.043400 0.8959667 1.131292 2.669417 Average 0.8519536 1.050069 0.9067437 1.146228 2.544570 5 Conclusions This paper contributed a recently released package for prototyping and inter- actively demonstrating recommendation algorithms in R. It comes with a nice range of implemented standard algorithms for Likert scaled and binary rat- ings. Reported results demonstrate that it reproduces results of the Java-based Lenskit toolkit. Thus it remains to hope that this effort will be of use for the field of recommender systems and the large R user community. References [1] Çoba, L., Zanker, M.: rrecsys: an r-package for prototyping recommendation algo- rithms. In: Guy, I., Sharma, A. (eds.) Poster Track of the 10th ACM Conference on Recommender Systems (RecSys 2016) (RecSysPosters). No. 1688 in CEUR Workshop Proceedings, Aachen (2016), http://ceur-ws.org/Vol-1688/#paper-12 [2] Çoba, L., Zanker, M.: Replication and reproduction in recommender systems re- search evidence from a case-study with the rrecsys library. In: 30th International Conference on Industrial Engineering and Other Applications of Applied Intelli- gent Systems, IEA/AIE 2017, Arras, France, June, 2017, Proceedings. Springer International Publishing, Cham (2017) [3] Ekstrand, M.D., Ludwig, M., Kolb, J., Riedl, J.T.: Lenskit: A modular recom- mender framework. In: Proceedings of the Fifth ACM Conference on Recom- mender Systems. pp. 349–350. RecSys ’11, ACM, New York, NY, USA (2011), http://doi.acm.org/10.1145/2043932.2044001 [4] Funk, S.: Netflix Update: Try this at Home (2006), http://sifter.org/ si- mon/journal/20061211.html [5] Gunawardana, A., Shani, G.: A Survey of Accuracy Evaluation Metrics of Rec- ommendation Tasks. The Journal of Machine Learning Research 10, 2935–2962 (2009) [6] Harper, F.M., Konstan, J.A.: The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. 5(4), 19:1–19:19 (Dec 2015) [7] Herlocker, J.L., Konstan, J.A., Borchers, A., Riedl, J.: An algorithmic framework for performing collaborative filtering. In: Proceedings of the 22Nd Annual Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 230–237. SIGIR ’99, ACM, New York, NY, USA (1999) [8] Jannach, D., Zanker, M., Ge, M., Gröning, M.: 13th International Conference on E-Commerce and Web Technologies, chap. Recommender Systems in Computer Science and Information Systems – A Landscape of Research, pp. 76–87. Springer Berlin Heidelberg, Berlin, Heidelberg (2012) [9] Karypis, G.: Evaluation of item-based top-n recommendation algorithms. In: Pro- ceedings of the Tenth International Conference on Information and Knowledge Management. pp. 247–254. CIKM ’01, ACM, New York, NY, USA (2001) [10] Lemire, D., Maclachlan, A.: Slope one predictors for online rating-based collabo- rative filtering. In: SDM. vol. 5, pp. 1–5. SIAM (2005) [11] Pan, R., Zhou, Y., Cao, B., Liu, N.N., Lukose, R., Scholz, M., Yang, Q.: One-class collaborative filtering. Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on pp. 502–511 (2008) [12] Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-thieme, L.: BPR : Bayesian Personalized Ranking from Implicit Feedback. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence cs.LG, 452–461 (2009), http://dl.acm.org/citation.cfm?id=1795167 [13] Said, A., Bellogı́n, A.: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks. RecSys pp. 129–136 (2014), http://dx.doi.org/10.1145/2645710.2645746 [14] Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: 10th Int. Conference on the World Wide Web. pp. 285–295 (2001) [15] Symeonidis, P., Nanopoulos, A., Papadopoulos, A.N., Manolopoulos, Y.: Collabo- rative recommender systems: Combining effectiveness and efficiency. Expert Syst. Appl. 34(4), 2995–3013 (May 2008)