BlurM(or)e: Revisiting Gender Obfuscation in the User-Item Matrix∗ Christopher Strucks Manel Slokom Martha Larson Radboud University TU Delft Radboud University and TU Delft Netherlands Netherlands Netherlands chr.strucks@gmail.com m.slokom@tudelft.nl m.larson@cs.ru.nl ABSTRACT Obfuscation is an important tool to maintaining user privacy, Past research has demonstrated that removing implicit gender in- alongside other tools such as encryption. Obfuscation is widely formation from the user-item matrix does not result in substantial studied in other areas, but does not receive a great amount of performance losses. Such results point towards promising solutions attention in the area of recommender systems, exceptions are [2, for protecting users’ privacy without compromising prediction 8]. Obfuscation can be added to the user-item matrix by users performance, which are of particular interest in multistakeholder themselves, freeing them from an absolute dependency on the environments. Here, we investigate BlurMe, a gender obfuscation service provider to secure their data and use it properly. In [1, 2] technique that has been shown to block classifiers from inferring the user can decide what data to reveal and how much protection is binary gender from users’ profiles. We first point out a serious put on the data. Even trusted service providers can have issues, such shortcoming of BlurMe: Simple data visualizations can reveal that as breaches, or data being acquired and used inappropriately [7]. BlurMe has been applied to a data set, including which items have The main contributions of this paper are: been impacted. We then propose an extension to BlurMe, called • A discussion of a flaw we discovered in BlurMe. BlurM(or)e, that addresses this issue. We reproduce the original • An extension to BlurMe, called BlurM(or)e, that addresses BlurMe experiments with the MovieLens data set, and point out this issue. the relative advantages of BlurM(or)e. • A set of experiments, whose results demonstrate the ability of BlurM(or)e to obfuscate binary gender in the user-item ma- CCS CONCEPTS trix with minimal impact on recommendation performance. • Information systems → Recommender systems. The paper is organized as follows. In Section 2, we cover the related work, before going on to present the shortcoming of BlurMe KEYWORDS and our proposed improvement BlurM(or)e in Section 3. Next, we Recommender Systems, Privacy, Data Obfuscation present our experiments and results in Section 4, and in Section 5 we discuss our reproduction of BlurMe1 . We finish in section 6 with a discussion and conclusion. 1 INTRODUCTION When users rate, or otherwise interact with items, they may be 2 BACKGROUND AND RELATED WORK aware that they are providing a recommender system with pref- In this section, we discuss work most closely related to our own. erence information. Less likely, is, however, that users know that interaction information can implicitly hold sensitive personal in- 2.1 Obfuscating the User-Item matrix formation. In this paper, we focus on the problem of binary gender information in the user-item matrix, which can be inferred by using In order to protect user demographic information in the user-item a gender classifier. The state of the art in gender obfuscation for matrix, researchers have suggested data obfuscation. Data obfusca- recommender system data, is to our knowledge, represented by tion (a.k.a. data masking) describes the process of hiding the original, Weinsberg et. al. [11], who propose a gender obfuscation approach possibly sensitive data with modified or even fictional data [10]. for a user-item matrix of movie ratings, called BlurMe. Successful The goal is to protect the privacy of users, while maintaining the obfuscation means that a user’s gender cannot be correctly inferred utility of the data. Data obfuscation can be done in several ways, by a classifier that has been previously trained on other users’ rating e.g., [9] used lexical substitution as obfuscation mechanism for text data. BlurMe accomplishes this obfuscation without a substantial or [5] used user groups instead of individual users to hide personal impact on the prediction performance of the recommender system information from the recommender system. In BlurMe [11], the that is trained on the obfuscated data. Our study of BlurMe has authors found that it is possible to infer the gender of users from revealed that it has a serious shortcoming. In this paper, we discuss their rating histories via basic machine learning classifiers. They this issue, and propose an extension to BlurMe, called BlurM(or)e, proposed an algorithm, BlurMe, which successfully obfuscates the that addresses it. We test BlurM(or)e against a reimplementation of gender of a user, thereby blocking gender inference. BlurMe ba- BlurMe, reproducing experiments from [11]. sically adds ratings to every user profile that are typical for the opposite gender, and is currently state of the art. The best perform- ∗ Copyright 2019 for this paper by its authors. Use permitted under Creative Commons ing BlurMe obfuscation strategy, the greedy strategy, decreases the License Attribution 4.0 International (CC BY 4.0). Presented at the RMSE workshop held in conjunction with the 13th ACM Conference 1 The code for the reproduction as well as for BlurM(or)e and the exploratory analysis on Recommender Systems (RecSys), 2019, in Copenhagen, Denmark. that we carried out is available at https://github.com/STrucks/BlurMore RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark Christopher Strucks, Manel Slokom, and Martha Larson accuracy of a logistic regression inference model from 80.2% on the original data to 2.5% on the obfuscated data (adding 10% extra ratings). The other proposed strategies have a smaller impact on the classification accuracy. For this reason, in this work, we focus on, and extend, the greedy strategy. Details and more explanations about the gender inference and obfuscation process can be found in section 5. 2.2 Inference on the User-Item matrix The goal of BlurMe obfuscation is to protect against gender infer- ence. The BlurMe [11] authors use basic machine learning models that can successfully infer users’ gender from the user-item matrix. The most recent work on inference on the user-item matrix is, to Figure 1: #ratings per movie for the movies 15 to 35. The our knowledge, that of [6], who developed a deep retentive learning red bar indicates an example of obvious data obfuscation af- framework that beats the conventional, standard machine learning ter BlurMe is applied. The BlurMe data was created with the approaches in the task of inferring user demographic information. greedy strategy and with 10% extra ratings. The BlurM(or)e For gender inference, [6] achieves a classification accuracy of 82%. data contains also 10% extra ratings. However, this is only 2% better than the standard logistic regression model used in [11]. We adopt the model from [11] here since it is sufficiently close to the state of the art for our purposes. Based on these insights, we designed BlurM(or)e, which works as follows: We create, just like BlurMe, two lists of movies, L f and Lm , 3 BUILDING A BETTER BLURME that correlate most strongly with females and males respectively. 3.1 The Issue with BlurMe After that, we alter every user profile by adding movies from the op- posite gender list with the greedy strategy proposed in BlurMe [11]. BlurMe [11] proposes a powerful algorithm that can obfuscate the However, if a movie has already doubled its initial rating count, gender of a user. However, BlurMe has an important flaw: If the it will be removed from the list. (We use ×2, i.e., doubling, in this rating frequency of the movies are visualized, it is possible to deter- paper because it works well, and leave exploration of other possible mine that BlurMe has been applied to the data set, and to identify values to future work). Also, we keep track of the number of added the movies for which ratings have been added. In figure 1(A), the ratings, so that we can remove the same number later on. After rating frequency is shown for 20 items from the MovieLens data every user has received extra ratings up to a fixed percentages of set before obfuscation. In figure 1(B), the rating frequency is shown their original ratings, we remove ratings from users that have rated for the same 20 items after obfuscation with BlurMe. BlurMe ex- a lot of movies (here we choose ⩾ 200 movies, although future hibits sharp spikes of items; here, it is item ID 27 (called Persuasion), work could investigate other values). The idea is that these users which is marked in red. These spikes indicate that BlurMe has been provide already enough data for the gender classifier, so removing applied, and point to the movies for which ratings have been added. some of their ratings would not impact the classifier. This idea is There are two dangers associated with these spikes. First, if BlurMe also inspired by our exploratory analysis, which revealed that the is running at an operating point of 10% extra ratings using the gender classifier does not benefit from additional data once a user greedy strategy, as mentioned above, then the gender inference has already provided 200 ratings. This removal would be more dif- accuracy is 2.5%. This means that if the information is known that ficult to diagnose in the user-item matrix, since exact information BlurMe has been applied, it is simple to reverse the decision of the of the rating rates about users would need to be available. classifier, and gender can be known with an accuracy of 97.5%. Sec- ond, if we do not know the operating point of BlurMe (<10% extra ratings will not guarantee us a gender classification accuracy that 4 EXPERIMENTS AND MAIN RESULTS we can reverse), we still can find the spikes in the rating histogram, 4.1 Data and attempt to reverse BlurMe. In order to find a BlurMe spike This study uses the publicly available MovieLens data set2 . We we would look for movies that are known not to be particularly chose MovieLens 1M, which is also used by BlurMe [11], whose popular, but still have a lot of ratings in the BlurMe data. In this work we are reproducing and extending. MovieLens 1M contains paper, we focus on addressing the first danger, and leave the second 3.7K movies and about 1M ratings of 6K different users, and also to future work. information on binary user gender. It is important to note that the distribution in the data set is unbalanced: there are 4331 males 3.2 The Definition of BlurM(or)e that produced 750K ratings and 1709 females that produced 250K BlurM(or)e was inspired by an exploratory analysis that we carried ratings. Statistics of the original and the obfuscated data sets, are out, which revealed that a large number of movies are indicative of summarized in Table 1. We note that the number of items decreases a gender. For this reason, it is not necessary to restrict the algorithm for BlurM(or)e data sets due to the fact that the algorithm might to add ratings only to the most correlated movies (like the greedy remove all ratings of a certain movies by accident. strategy of BlurMe does). This means that we can mask the data without heavily relying on a small set of movies indicative of gender. 2 https://grouplens.org/datasets/movielens/ BlurM(or)e RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark Table 1: Statistics of the data sets used in our experiments and analysis. data set #Users #Items #Ratings Range Av.rating Density(%) Variance MovieLens 1m 6040 3706 1.000.209 [1,5] 3.58 4.47 1.25 BlurMe 1% extra ratings 6040 3706 1.013.416 [1,5] 3.58 4.53 1.25 BlurMe 5% extra ratings 6040 3706 1.052.886 [1,5] 3.58 4.70 1.20 BlurMe 10% extra ratings 6040 3706 1.099.545 [1,5] 3.57 4.91 1.16 BlurM(or)e 1% extra ratings 6040 3705 1.000.797 [1,5] 3.57 4.47 1.24 BlurM(or)e 5% extra ratings 6040 3700 1.000.773 [1,5] 3.55 4.48 1.22 BlurM(or)e 10% extra ratings 6040 3699 1.000.395 [1,5] 3.57 4.48 1.16 4.2 Comparison of BlurMe and BlurM(or)e 10% extra ratings). We can see in Table 3 that the RMSE is decreas- We compare the performance of our new obfuscation mechanism, ing with an increase in obfuscation. BlurMe [11] discovered the BlurM(or)e, with the original obfuscation mechanism BlurMe. The same effect and explained that this might be due to the density of performance is measured, in line with the experiments in BlurMe, the obfuscated data. Since BlurM(or)e does not increase the overall by the classification accuracy of a logistic regression model that is density of the data, an alternative explanation can be found. The trained on unaltered data, and tested on obfuscated data. The per- reason, lies perhaps, in increasing the density of users with few formance is cross-validated using 10-fold cross-validation. Table 2 ratings. shows that BlurM(or)e performs similarly to BlurMe. The more obfuscation is applied to the data set, the lower the classification Extra ratings accuracy is. Note that Table 2 contains the reproduction of BlurMe Obfuscation 0% 1% 5% 10% that is discussed in detail in section 5. A big advantage of BlurM(or)e Original 0.8766 — — — BlurMe 0.8766 0.8686 0.8553 0.8385 BlurM(or)e 0.8766 0.8711 0.8640 0.8468 Extra ratings Table 3: The RMSE performance with Matrix Factorization Classifier Data set 0% 1% 5% 10% on the original data, BlurMe data and on BlurM(or)e data. Logistic Regression BlurMe 0.76 0.54 0.15 0.02 Logistic Regression BlurM(or)e 0.76 0.64 0.36 0.19 Random Classifier Original 0.50 0.50 0.50 0.50 5 BLURME REPRODUCTION IN DETAIL Table 2: Gender inference results measured in accuracy on Since we did not have the code of the original BlurMe [11], we BlurMe (reproduction) and BlurM(or)e reimplemented it in order to carry out the comparison in this pa- per. Because the paper was not specific about the settings of all parameters, it is not possible to create an exact replication. For com- pleteness, we discuss our reimplementation here, so that authors is that an attacker cannot easily see that the data set is obfuscated. building on our work have the complete details. Figure 1 on the previous page shows the number of ratings per movie for 20 different movies in the MovieLens 1m data set. The 5.1 Gender Inference red bar corresponds to the number of ratings for the movie with This section describes our reimplementation of the gender inference ID 27. After the BlurMe obfuscation is applied, the red bar spans models. We create the user-item matrix by associating every user approximately ten times its original size. This makes the attacker with a vector of ratings: x i with i being the index of the user and suspicious and indicates that the data set is obfuscated. However, x i, j being the rating of user i for movie j. If the movie is not rated, if the BlurM(or)e obfuscation is applied, the red bar only doubles we set x i, j = 0. This results in a U x I matrix, where U is the number its size, which is less noticeable. Also, BlurM(or)e has more similar of users and I is the number of items. Every user vector is associated statistics to the original data. Table 1 shows that BlurM(or)e keeps with a gender, that will serve as target label for the classifier. the number of interactions as well as the density similar to the Following the experiments of [11], all classifiers are trained and original MovieLens data set, while BlurMe produces a more dense tested on this user-item matrix with 10-fold cross-validation. We data set with more interactions. do not have information about the splits that were used, so we use The reduction part of BlurM(or)e has a less noticeable effect on our own splits. The ROC area under the curve as well as precision the data set. Since the ratings are removed randomly from users and recall are reported as performance measures. A comparison of with an extreme number of ratings, the number of ratings per the results can be seen in Table 4. The SVM uses a linear kernel movies distribution does not change dramatically (the bar with ID and a C value of 1. For the Bernoulli classifier, the user-item matrix 20 shrinks ≈ 10% of its original size in the BlurM(or)e data set). is transformed, so that every rating x i, j that is greater than 0, is set to 1. This means that the Bernoulli Bayes classifier ignores the 4.3 Recommendation Performance value of the rating and only uses information about whether a user Using a well known collaborative filtering technique, Matrix Factor- i rated the movie j or not. All remaining parameters for the other ization [4], similar to BlurMe, we notice that the change in RMSE is classifiers are set to the default values. not substantial. The change has a maximum of 0.0298 for Movielens There is about a 4% difference between the scores reported in with BlurM(or)e and 0.0381 for BlurMe (with greedy strategy and the original BlurMe paper [11], and those we measured with our RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark Christopher Strucks, Manel Slokom, and Martha Larson reproduction. Further exploration revealed that normalization, in no difference between these approaches, we chose to set the extra terms of scaling all ratings from values in [0, 5] to values in [0, 1], ratings for a movie according to its respective overall average rating. can have a large impact on scores. We do not focus on normalization This average is rounded, because only integer ratings are valid. further here, but point out its impact because it suggests that there The authors of BlurMe take the following attack protocol into ac- are parameters that could have been adjusted that are not explicitly count: A gender inference model is trained on real, non-obfuscated recorded in [11]. In this paper, we have chosen to focus on the data and tested on the obfuscated data. For this reason, the gender logistic regression model, since it is the fastest and achieves the inference model is trained on unaltered data and tested on obfus- best performance. cated data. They use 10-fold cross-validation and report the average classification accuracy of the model. BlurMe results Reproduction Results We report results achieved by our BlurMe reproduction in Table 5. Classifier AUC P/R AUC P/R The reproduction is generally congruent with the original. The Bernoulli 0.81 0.79/0.76 0.77 0.88/0.48 difference is negligible, we can see that the classification accuracy Multinomial 0.84 0.80/0.76 0.81 0.89/0.77 decreases if the obfuscation increases. SVM 0.86 0.78/0.77 0.79 0.83/0.82 Logistic Regression 0.85 0.80/0.80 0.81 0.84/0.83 Extra ratings Table 4: Gender inference results for both, BlurMe and the Strategy 0% 1% 5% 10% reproduction thereof. The performance is measured in ROC Random 0.802 0.776 0.715 0.611 BlurMe Sampled 0.802 0.752 0.586 0.355 AUC, precision and recall. Greedy 0.802 0.577 0.173 0.025 Random 0.76 0.74 0.69 0.62 Reproduction Sampled 0.76 0.71 0.58 0.33 Note that Table 4 uses ROC AUC as performance metric and Ta- Greedy 0.76 0.54 0.15 0.02 ble 2 uses classification accuracy. This choice was made by BlurMe Random 0.81 0.80 0.78 0.76 and for the sake of comparing the models, we did the same. Reproduction, Normalized Sampled 0.81 0.80 0.78 0.75 Greedy 0.81 0.78 0.74 0.70 5.2 Gender Obfuscation Table 5: Performance of BlurMe’s and the reproduction’s ob- This section describes our reimplementation of the obfuscation fuscation algorithm measured by classification accuracy. approach of BlurMe [11]. Recall that the basic idea of BlurMe is to add fictional ratings to every user that are atypical for the their 6 DISCUSSION & CONCLUSION gender. BlurMe [11] creates two lists, L f and Lm , of atypical movies In conclusion, this work points to a weakness in a state-of-the-art for each gender by training and cross-validating a logistic regression gender obfuscation algorithm, BlurMe [11], and presents an im- model on the training set. The movies in L f and Lm are ranked proved algorithm, BlurM(or)e, that addresses the issue. BlurM(or)e according to their average rank across the folds. The rank of a is shown to be able to obfuscate gender in the user-item matrix movie within a fold is determined by its coefficient that is learned without substantial increase in RMSE. In other words, it keeps the by the logistic regression model. The lists L f and Lm also include utility of the data set intact. This work has shed light on some of the average coefficient over all folds for each movie that serve as the challenges of gender obfuscation. correlation metric between the movie and the user’s gender. We finish with a discussion of points from [11] that should be After these lists are created, BlurMe takes every user profile and taken into account in future research. As mentioned before, normal- adds k fictive ratings to the profile for movies from the opposite ization of the data set can have an enormous impact on the classifi- gender list. The parameter k limits the number of extra ratings and cation performance. In Table 5, we see that when our reproduction is set to 1%, 5% or 10% in the original experiments. A male user incorporates normalization the accuracy of gender inference still with 100 ratings in the original data set would be obfuscated by decreases with increasing obfuscation, but at a much slower rate. adding 5 (for k = 5%) fictive ratings from the female list. In addition, BlurMe used the ROC area under the curve metric for There are some design choices left: Which movies should be the first gender inference experiments, yet changed to classification selected from the lists and what should the fictive rating be? The accuracy for the gender inference on the obfuscated data set. Using authors of BlurMe [11] proposed three different selection strategies accuracy as a performance metric on imbalanced data sets is a for the first problem: the Random Strategy, the Sampled Strategy practice that should be avoided. It is advised to report the ROC and the Greedy Strategy. The Random Strategy chooses k movies AUC, precision-recall AUC and ROC AUC on skew-normalized uniformly at random from the list, the Sampled Strategy chooses data when dealing with imbalanced data sets [3]. k movies randomly, but in line with the score distribution of the Finally, BlurMe declares (in [11]) the classification accuracy of movies. Thus, a movie that has a high coefficient is more likely to 2.5% as a success. One can argue that the gender is only truly be added. Finally, the Greedy Strategy chooses the movie with the obfuscated if an attacking model achieves the same performance as highest score. The authors do not mention the length of the lists, a random classifier (i.e., exactly 50% accuracy, in the case of binary thus we chose to include all movies with a positive coefficient in classification). This point should be taken into account in deciding the L f list, and all movies with a negative coefficient in the Lm list. the operational settings for BlurMe or BlurM(or)e. The decision For the fictive rating of a user A for a movie B, BlurMe suggests also needs to consider the ease with which it is possible to detect using either the average rating for movie B or the predicted rating whether a user’s data has been obfuscated. Future work will study for user A for movie B. Since [11] reports that there is almost possibilities for obfuscating obfuscation. BlurM(or)e RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark REFERENCES [6] Yongsheng Liu, Hong Qu, Wenyu Chen, and SM Hasan Mahmud. 2019. An [1] Shlomo Berkovsky, Yaniv Eytani, Tsvi Kuflik, and Francesco Ricci. 2007. Enhanc- Efficient Deep Learning Model to Infer User Demographic Information From ing Privacy and Preserving Accuracy of a Distributed Collaborative Filtering. In Ratings. IEEE Access 7 (2019), 53125–53135. Proceedings of the 2007 ACM Conference on Recommender Systems (RecSys ’07). [7] Roger McNamee and Sandy Parakilas. 2018. The Face- ACM, 9–16. book breach makes it clear: data must be regulated, The [2] Shlomo Berkovsky, Tsvi Kuflik, and Francesco Ricci. 2012. The Impact of Data Guardian. https://www.theguardian.com/commentisfree/2018/mar/19/ Obfuscation on the Accuracy of Collaborative Filtering. Expert Systems with facebook-data-cambridge-analytica-privacy-breach, Online; accessed 05-July- Applications 39, 5 (2012), 5033–5042. 2019. [3] László A Jeni, Jeffrey F Cohn, and Fernando De La Torre. 2013. Facing Imbalanced [8] Rupa Parameswara and Douglas M. Blough. 2007. Privacy Preserving Collabora- Data—Recommendations for the Use of Performance Metrics. In 2013 Humaine tive Filtering Using Data Obfuscation. In 2007 IEEE International Conference on Association Conference on Affective Computing and Intelligent Interaction. IEEE, Granular Computing (GRC ’07). IEEE, 380–380. 245–251. [9] Sravana Reddy and Kevin Knight. 2016. Obfuscating Gender in Social Media [4] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech- Writing. In Proceedings of the 2016 EMNLP Workshop on NLP and Computational niques for Recommender Systems. IEEE Computer Society Press 42, 8 (2009), Social Science. ACL, 17–26. 30–37. [10] Vicenç Torra. 2017. Data Privacy: Foundations, New Developments and the Big [5] Dongsheng Li, Qin Lv, Li Shang, and Ning Gu. 2017. Efficient Privacy-Preserving Data Challenge. Springer International Publishing, Cham, 191–238. Content Recommendation for Online Social Communities. Neurocomputing 219 [11] Udi Weinsberg, Smriti Bhagat, Stratis Ioannidis, and Nina Taft. 2012. BlurMe: (2017), 440–454. Inferring and Obfuscating User Gender Based on Ratings. In Proceedings of the 2012 ACM Conference on Recommender Systems (RecSys ’12). ACM, 195–202.