Data Masking for Recommender Systems: Prediction Performance and Rating Hiding Manel Slokom Martha Larson TU Delft Radboud University and TU Delft Netherlands Netherlands m.slokom@tudelft.nl m.larson@cs.ru.nl Alan Hanjalic TU Delft Netherlands A.Hanjalic@tudelft.nl ABSTRACT Data science challenges allow companies, and other data holders, to collaborate with the wider research community. In the area of recommender systems, the potential of such challenges to move forward the state of the art is limited due to concerns about releasing user interaction data. This paper investigates the potential of privacy-preserving data publishing for supporting recommender system challenges. We propose a data masking algorithm, Shuffle-NNN, with two steps: Neighborhood selection and value swapping. Neighborhood selection preserves valuable item similarity information. The data shuffling technique hides (i.e., changes) ratings of users for individual items. Our experimental results demonstrate that the relative performance of algorithms, which is the key property that a data science challenge must measure, is comparable between the original data and the data masked with Shuffle-NNN. CCS CONCEPTS • Information systems → Recommender systems. ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Data Masking for Recommender Systems ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark KEYWORDS Recommender systems, privacy-preserving data publishing, data masking INTRODUCTION TO DATA MASKING WITH SHUFFLE-NNN We propose a masking approach, Shuffling Non-Nearest-Neighbors (Shuffle-NNN), which changes (i.e., hides) a large proportion of the values of the ratings in the original user-item matrix, R, to create ′ a masked data set, R . Our work is a step towards masked data that can be publicly released for data science challenges. For this reason, we adopt a relative-rank success criterion: given a set of recommender algorithms to be compared, the relative ranking of the algorithms trained and tested on ′ R and on R must be maintained. Shuffle-NNN is motivated by the observation that privacy-preserving techniques [13] (privacy-preserving data publishing [3], privacy-preserving data mining [1]) have not yet been systematically applied to the data released for recommender system challenges. Shuffle-NNN generates a masked data by changing a large portion of values of the ratings in a user’s profile. We chose a data shuffling technique because of its previous success in masking numerical data in other domains [11]. Shuffle-NNN works as follows. Our overall strategy is to shuffle ratings in a way that maintains key item-item similarities in the data set. First, we determine the item neighborhoods, i.e., the most similar item for every item, and join them, giving us an overall set of critical items. Then, we shuffle ratings for items not in this set (non-critical items, i.e., the rest of the items). Shuffle-NNN has two parameters. We fix the neighbhorhood size k = 40, since this value has been shown to be effective in 1 http://www.mymedialite.net/examples/datasets.html practice1 . Given k, we set θ (a minimum threshold on item-similarity that must be met for inclusion in a neighborhood) by running some test rankings. Our exploratory experiments showed that a range of θ values can be effective, and that θ can be determined using a subset of the algorithms to be ranked (meaning that it can be set in-house before releasing data). Recall that the goal of this paper is not to demonstrate the absolute performance of the algorithms, but rather to evaluate if the relative performance of algorithms is the same on the original and on the masked data. For this purpose, we need a selection of classic recommender algorithms, ranging from baselines that are known not to yield state-of-the-art performance, to current algorithms. We carry out experiments on both ranking prediction, and on classic rating prediction tasks. The algorithms for ranking prediction are: Most Popular, KNN is user/item-based K-Nearest Neighbor. (BPRFM) [5] and BPRMF. We followed the hyperparameters tested in [9].The algorithms for rating prediction are: ItemKNN [12], UserKNN, SlopeOne [8],Co-clustering [4]. matrix factorization (MF), Biased Matrix Factorization (BMF) [7] and (SVD++) [6]. We also test baseline algorithms: Average-Item and Average-User which use the average rating value of a user or item for predictions. Data Masking for Recommender Systems ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark Table 1: Statistics of the data sets used in our experiments and analysis. data set #Users #Items #Ratings Range Av.rating Density(%) Variance MovieLens 100k/ Random empirical /Masked 943/-/- 1682 /-/- 100.000/-/- [1,5] /-/- 3.529/2.997/3.529 6.30 /-/- 1.1256/ 1.415/ 1.125 GoodBook/ Random empirical /Masked 53424/-/- 10000/-/- 981.756/-/- [1,5]/-/- 3.8565/ 3.858/3.856 0.18/-/- 0.9839/ 0.9837/0.9838 Table 2: The ranking prediction performance on the original data, on the masked data and on the empirical data. For the masked MovieLens data we used θ = 0.4 and for the masked GoodBook data we used θ = 0.15. R@5 / R@10 for MovieLens data set R@5 / R@10 for GoodBook data Original Masked Original Masked UserKNN 0.093/0.177 0.093/0.178 0.434/0.595 0.433/0.594 The relative BPRMF 0.084/0.165 0.085/0.166 0.398/0.575 0.396/0.571 ranking of BPRFM 0.082/ 0.159 0.082/0.159 0.387/0.567 0.382/0.563 algorithms ItemKNN 0.079/ 0.156 0.079/0.155 0.315/0.452 0.314/0.451 Popular 0.058/ 0.099 0.064/0.108 0.033/0.053 0.028/0.059 EXPERIMENTAL FRAMEWORK The experiments are implemented using WrapRec [10]. We test on two publicly available data sets (cf. 2 https://grouplens.org/datasets/movielens/ Table 1). We choose MovieLens 100k 2 , since it is well understood, and Goodbooks-10K 3 , since it is 3 https://www.kaggle.com/philippsp/book- larger and sparser. recommender-collaborative-filtering- shiny/data COMPARATIVE ALGORITHM RANKING Ranking Prediction Performance We train and test the algorithms for ranking prediction both on the original data and and the masked data. Results are given in Table 2. It can be seen that the comparative algorithm ranking is maintained. In other words, for all cases, the best algorithm on the masked data is also the best algorithm on the original data and the worst algorithm on the masked data is also the worst algorithm on the original data. These results demonstrate the success of Shuffle-NNN. In Table 2, we report results for specific values of θ . However, relative ranking is actually maintained for a range of values of θ (not shown here). In addition to Recall@5 and Recall@10, we found also that Precision@5/ Precision@10 maintains the same relative ranking of algorithms. Interestingly, when we tested our algorithms on empirical data (which we generate by replacing individual values with values drawn randomly from the global distribution of ratings in the original data), the comparative ranking was also preserved. This means that Shuffle-NNN is sufficient, but Data Masking for Recommender Systems ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark 4.5 Table 3: The prediction performance on the original data, on the masked data and on the empirical data. Note: Lower RMSE is better. The colors show the ranking of algorithms. For the masked Movie- Masked data for theta= 0.4 Lens data we used θ = 0.4 and for the masked GoodBook data we used θ = 0.15. 4.0 RMSE for MovieLens data set RMSE for GoodBook data set Original Masked Original Masked SVD++ (0.902) SVD++ (0.957) BMF (0.822) BMF (0.85) 3.5 BMF (0.911) BMF (0.962) SVD++ (0.825) SVD++ (0.854) ItemKNN (0.918) MF (0.968) ItemKNN (0.826) ItemKNN (0.861) 3.0 The relative MF (0.929) ItemKNN (0.972) MF (0.856) MF (0.873) ranking of UserKNN (0.935) UserKNN (0.979) SlopeOne (0.865) UserKNN (0.894) algorithms SlopeOne (0.937) SlopeOne (0.986) UserKNN (0.868) SlopeOne (0.903) 2.5 Co-Clustering (0.974) Co-Clustering (1.026) Average-User (0.883) Average-User (0.913) Average-Item (1.023) Average-Item (1.030) Co-Clustering (0.891) Co-Clustering (0.924) Average-User (1.042) Average-User (1.081) Average-Item (0.948) Average-Item (0.947) 2.0 2 3 4 5 Movielens data not actually necessary in the case of Top-N recommendation. Our conclusion from these results is that (at least for these data sets) the values of the ratings are not important if the goal is a relative 5.0 ranking among algorithms. What is important is that our masked data sets maintain information on Masked data for theta= 0.15 which items were rated. 4.5 4.0 Rating Prediction Performance To dig deeper, we carry out rating prediction experiments. Results are reported in Table 3. Here, we 3.5 observe that the relative ranking is well maintained between the original data and masked data 3.0 (although not perfectly). Our focus here is on relative performance, but it is interesting to note that the absolute RMSE on the masked data remains within 5% of its value on the original data. In contrast to 2.5 the ranking-prediction results, we found that the empirical distribution does less well in maintaining 2.0 the ranking than Shuffle-NNN. 1.5 RATING HIDING 1.0 Next, we discuss the ability of Shuffle-NNN to hide ratings. A rating is considered hidden if its value changes between the masked data and the original data [2]. First, we look at the global proportion 1 2 3 4 5 GoodBook data of ratings hidden in the masked data. We find that at our operating point, Shuffle-NNN achieved a rating hiding percentage of 0.7 for MovieLens data and 0.68 for the GoodBook data. Figure 1: Average user ratings original Then, we look at the impact of masking at the user level. Figure 1 illustrates the relationship vs. masked data (masked using θ = 0.4 for between average user ratings before and after masking. Figure 2 shows, for different level of hidden MovieLens; θ = 0.15 for GoodBook). ratings, for how many users that level was achieved. The average user rating of the masked data is Data Masking for Recommender Systems ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark impacted, but still correlated with the original values. The protection varies per user, but is relatively 80 high. For the GoodBook data, there is a large number of users have a hiding level equal to zero. On 70 average these users have rated less than 30 books. 60 CONCLUSIONS AND OUTLOOK Users of MovieLens Our overall conclusion is that data masking has great potential for data science challenges. It is possible 50 to develop a masking approach, such that masked data can be used to train and test algorithms 40 with little impact on the relative performance of algorithms. Shuffle-NNN provides valuable evidence about what information can be removed from the user-item matrix and what information should be 30 maintained. When the critical items list is larger than the list of items that is shuffled, it is easier to 20 reconstruct the original values. Future work will investigate the difficulty of reconstructing the original values from the shuffled data, which is an issue important to consider in cases where the critical items 10 are in the majority. We note that even modest levels of rating hiding can support deniability. 0 0.0 0.1 0.2 0.3 0.4 REFERENCES The user-hiding level [1] Rakesh Agrawal and Ramakrishnan Srikant. 2000. Privacy-preserving Data Mining. SIGMOD Rec. 29, 2 (2000), 439–450. [2] Elisa Bertino, Igor Nai Fovino, and Loredana Parasiliti Provenza. 2005. A framework for evaluating privacy preserving data mining algorithms. Data Mining and Knowledge Discovery 11, 2 (2005), 121–154. 17500 [3] Benjamin CM Fung, Ke Wang, Ada Wai-Chee Fu, and S Yu Philip. 2010. Introduction to privacy-preserving data publishing: Concepts and techniques. Chapman and Hall/CRC. Users of GoodBook data 15000 [4] T. George and S. Merugu. 2005. A scalable collaborative filtering framework based on co-clustering. In Proceedings of the Fifth IEEE Conference on Data Mining (ICDM’05). 625–628. 12500 [5] Weiyu Guo, Shu Wu, Liang Wang, and Tieniu Tan. 2016. Personalized Ranking with Pairwise Factorization Machines. Neurocomput 214 (2016), 191–200. 10000 [6] Yehuda Koren. 2008. Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’08). ACM. 7500 [7] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. IEEE Computer Society Press 42, 8 (2009), 30–37. 5000 [8] Daniel Lemire and Anna Maclachlan. 2005. Slope one predictors for online rating-based collaborative filtering. In Proceedings of the 2005 SIAM International Conference on Data Mining. SIAM, 471–475. 2500 [9] Babak Loni, Roberto Pagano, Martha Larson, and Alan Hanjalic. 2019. Top-N Recommendation with Multi-Channel Positive Feedback Using Factorization Machines. ACM Trans. Inf. Syst. 37, 2 (2019), 15:1–15:23. 00.00 0.25 0.50 0.75 1.00 [10] Babak Loni and Alan Said. 2014. WrapRec: An Easy Extension of Recommender System Libraries. In Proceedings of the 8th The user-hiding level ACM Conference on Recommender Systems (RecSys ’14). ACM, 377–378. [11] Krishnamurty Muralidhar and Rathindra Sarathy. 2006. Data shuffling: A new masking approach for numerical data. Figure 2: Average user ratings original Management Science 52, 5 (2006), 658–670. [12] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based Collaborative Filtering Recommendation vs. masked data (masked using θ = 0.4 for Algorithms. In Proceedings of the 10th International Conference on World Wide Web (WWW ’01). ACM. MovieLens; θ = 0.15 for GoodBook). [13] Vicenç Torra. 2017. Masking Methods. Springer International Publishing, Cham, 191–238.