=Paper= {{Paper |id=Vol-2440/short2 |storemode=property |title= BlurM(or)e: Revisiting Gender Obfuscation in the User-Item Matrix |pdfUrl=https://ceur-ws.org/Vol-2440/short2.pdf |volume=Vol-2440 |authors=Christopher Strucks, Manel Slokom, Martha Larson |dblpUrl=https://dblp.org/rec/conf/recsys/StrucksSL19 }} == BlurM(or)e: Revisiting Gender Obfuscation in the User-Item Matrix== https://ceur-ws.org/Vol-2440/short2.pdf

BlurM(or)e: Revisiting Gender Obfuscation
in the User-Item Matrix∗
Christopher Strucks Manel Slokom Martha Larson
Radboud University TU Delft Radboud University and TU Delft
Netherlands Netherlands Netherlands
chr.strucks@gmail.com m.slokom@tudelft.nl m.larson@cs.ru.nl

ABSTRACT Obfuscation is an important tool to maintaining user privacy,
Past research has demonstrated that removing implicit gender in- alongside other tools such as encryption. Obfuscation is widely
formation from the user-item matrix does not result in substantial studied in other areas, but does not receive a great amount of
performance losses. Such results point towards promising solutions attention in the area of recommender systems, exceptions are [2,
for protecting users’ privacy without compromising prediction 8]. Obfuscation can be added to the user-item matrix by users
performance, which are of particular interest in multistakeholder themselves, freeing them from an absolute dependency on the
environments. Here, we investigate BlurMe, a gender obfuscation service provider to secure their data and use it properly. In [1, 2]
technique that has been shown to block classifiers from inferring the user can decide what data to reveal and how much protection is
binary gender from users’ profiles. We first point out a serious put on the data. Even trusted service providers can have issues, such
shortcoming of BlurMe: Simple data visualizations can reveal that as breaches, or data being acquired and used inappropriately [7].
BlurMe has been applied to a data set, including which items have The main contributions of this paper are:
been impacted. We then propose an extension to BlurMe, called • A discussion of a flaw we discovered in BlurMe.
BlurM(or)e, that addresses this issue. We reproduce the original • An extension to BlurMe, called BlurM(or)e, that addresses
BlurMe experiments with the MovieLens data set, and point out this issue.
the relative advantages of BlurM(or)e. • A set of experiments, whose results demonstrate the ability
of BlurM(or)e to obfuscate binary gender in the user-item ma-
CCS CONCEPTS trix with minimal impact on recommendation performance.
• Information systems → Recommender systems. The paper is organized as follows. In Section 2, we cover the
related work, before going on to present the shortcoming of BlurMe
KEYWORDS and our proposed improvement BlurM(or)e in Section 3. Next, we
Recommender Systems, Privacy, Data Obfuscation present our experiments and results in Section 4, and in Section 5
we discuss our reproduction of BlurMe1 . We finish in section 6 with
a discussion and conclusion.
1 INTRODUCTION
When users rate, or otherwise interact with items, they may be 2 BACKGROUND AND RELATED WORK
aware that they are providing a recommender system with pref- In this section, we discuss work most closely related to our own.
erence information. Less likely, is, however, that users know that
interaction information can implicitly hold sensitive personal in- 2.1 Obfuscating the User-Item matrix
formation. In this paper, we focus on the problem of binary gender
information in the user-item matrix, which can be inferred by using In order to protect user demographic information in the user-item
a gender classifier. The state of the art in gender obfuscation for matrix, researchers have suggested data obfuscation. Data obfusca-
recommender system data, is to our knowledge, represented by tion (a.k.a. data masking) describes the process of hiding the original,
Weinsberg et. al. [11], who propose a gender obfuscation approach possibly sensitive data with modified or even fictional data [10].
for a user-item matrix of movie ratings, called BlurMe. Successful The goal is to protect the privacy of users, while maintaining the
obfuscation means that a user’s gender cannot be correctly inferred utility of the data. Data obfuscation can be done in several ways,
by a classifier that has been previously trained on other users’ rating e.g., [9] used lexical substitution as obfuscation mechanism for text
data. BlurMe accomplishes this obfuscation without a substantial or [5] used user groups instead of individual users to hide personal
impact on the prediction performance of the recommender system information from the recommender system. In BlurMe [11], the
that is trained on the obfuscated data. Our study of BlurMe has authors found that it is possible to infer the gender of users from
revealed that it has a serious shortcoming. In this paper, we discuss their rating histories via basic machine learning classifiers. They
this issue, and propose an extension to BlurMe, called BlurM(or)e, proposed an algorithm, BlurMe, which successfully obfuscates the
that addresses it. We test BlurM(or)e against a reimplementation of gender of a user, thereby blocking gender inference. BlurMe ba-
BlurMe, reproducing experiments from [11]. sically adds ratings to every user profile that are typical for the
opposite gender, and is currently state of the art. The best perform-
∗ Copyright 2019 for this paper by its authors. Use permitted under Creative Commons
ing BlurMe obfuscation strategy, the greedy strategy, decreases the
License Attribution 4.0 International (CC BY 4.0).
Presented at the RMSE workshop held in conjunction with the 13th ACM Conference 1 The code for the reproduction as well as for BlurM(or)e and the exploratory analysis
on Recommender Systems (RecSys), 2019, in Copenhagen, Denmark. that we carried out is available at https://github.com/STrucks/BlurMore
RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark Christopher Strucks, Manel Slokom, and Martha Larson

accuracy of a logistic regression inference model from 80.2% on
the original data to 2.5% on the obfuscated data (adding 10% extra
ratings). The other proposed strategies have a smaller impact on
the classification accuracy. For this reason, in this work, we focus
on, and extend, the greedy strategy. Details and more explanations
about the gender inference and obfuscation process can be found
in section 5.

2.2 Inference on the User-Item matrix
The goal of BlurMe obfuscation is to protect against gender infer-
ence. The BlurMe [11] authors use basic machine learning models
that can successfully infer users’ gender from the user-item matrix.
The most recent work on inference on the user-item matrix is, to Figure 1: #ratings per movie for the movies 15 to 35. The
our knowledge, that of [6], who developed a deep retentive learning red bar indicates an example of obvious data obfuscation af-
framework that beats the conventional, standard machine learning ter BlurMe is applied. The BlurMe data was created with the
approaches in the task of inferring user demographic information. greedy strategy and with 10% extra ratings. The BlurM(or)e
For gender inference, [6] achieves a classification accuracy of 82%. data contains also 10% extra ratings.
However, this is only 2% better than the standard logistic regression
model used in [11]. We adopt the model from [11] here since it is
sufficiently close to the state of the art for our purposes. Based on these insights, we designed BlurM(or)e, which works as
follows: We create, just like BlurMe, two lists of movies, L f and Lm ,
3 BUILDING A BETTER BLURME that correlate most strongly with females and males respectively.
3.1 The Issue with BlurMe After that, we alter every user profile by adding movies from the op-
posite gender list with the greedy strategy proposed in BlurMe [11].
BlurMe [11] proposes a powerful algorithm that can obfuscate the
However, if a movie has already doubled its initial rating count,
gender of a user. However, BlurMe has an important flaw: If the
it will be removed from the list. (We use ×2, i.e., doubling, in this
rating frequency of the movies are visualized, it is possible to deter-
paper because it works well, and leave exploration of other possible
mine that BlurMe has been applied to the data set, and to identify
values to future work). Also, we keep track of the number of added
the movies for which ratings have been added. In figure 1(A), the
ratings, so that we can remove the same number later on. After
rating frequency is shown for 20 items from the MovieLens data
every user has received extra ratings up to a fixed percentages of
set before obfuscation. In figure 1(B), the rating frequency is shown
their original ratings, we remove ratings from users that have rated
for the same 20 items after obfuscation with BlurMe. BlurMe ex-
a lot of movies (here we choose ⩾ 200 movies, although future
hibits sharp spikes of items; here, it is item ID 27 (called Persuasion),
work could investigate other values). The idea is that these users
which is marked in red. These spikes indicate that BlurMe has been
provide already enough data for the gender classifier, so removing
applied, and point to the movies for which ratings have been added.
some of their ratings would not impact the classifier. This idea is
There are two dangers associated with these spikes. First, if BlurMe
also inspired by our exploratory analysis, which revealed that the
is running at an operating point of 10% extra ratings using the
gender classifier does not benefit from additional data once a user
greedy strategy, as mentioned above, then the gender inference
has already provided 200 ratings. This removal would be more dif-
accuracy is 2.5%. This means that if the information is known that
ficult to diagnose in the user-item matrix, since exact information
BlurMe has been applied, it is simple to reverse the decision of the
of the rating rates about users would need to be available.
classifier, and gender can be known with an accuracy of 97.5%. Sec-
ond, if we do not know the operating point of BlurMe (<10% extra
ratings will not guarantee us a gender classification accuracy that
4 EXPERIMENTS AND MAIN RESULTS
we can reverse), we still can find the spikes in the rating histogram, 4.1 Data
and attempt to reverse BlurMe. In order to find a BlurMe spike This study uses the publicly available MovieLens data set2 . We
we would look for movies that are known not to be particularly chose MovieLens 1M, which is also used by BlurMe [11], whose
popular, but still have a lot of ratings in the BlurMe data. In this work we are reproducing and extending. MovieLens 1M contains
paper, we focus on addressing the first danger, and leave the second 3.7K movies and about 1M ratings of 6K different users, and also
to future work. information on binary user gender. It is important to note that the
distribution in the data set is unbalanced: there are 4331 males
3.2 The Definition of BlurM(or)e that produced 750K ratings and 1709 females that produced 250K
BlurM(or)e was inspired by an exploratory analysis that we carried ratings. Statistics of the original and the obfuscated data sets, are
out, which revealed that a large number of movies are indicative of summarized in Table 1. We note that the number of items decreases
a gender. For this reason, it is not necessary to restrict the algorithm for BlurM(or)e data sets due to the fact that the algorithm might
to add ratings only to the most correlated movies (like the greedy remove all ratings of a certain movies by accident.
strategy of BlurMe does). This means that we can mask the data
without heavily relying on a small set of movies indicative of gender. 2 https://grouplens.org/datasets/movielens/
BlurM(or)e RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark

Table 1: Statistics of the data sets used in our experiments and analysis.
data set #Users #Items #Ratings Range Av.rating Density(%) Variance
MovieLens 1m 6040 3706 1.000.209 [1,5] 3.58 4.47 1.25
BlurMe 1% extra ratings 6040 3706 1.013.416 [1,5] 3.58 4.53 1.25
BlurMe 5% extra ratings 6040 3706 1.052.886 [1,5] 3.58 4.70 1.20
BlurMe 10% extra ratings 6040 3706 1.099.545 [1,5] 3.57 4.91 1.16
BlurM(or)e 1% extra ratings 6040 3705 1.000.797 [1,5] 3.57 4.47 1.24
BlurM(or)e 5% extra ratings 6040 3700 1.000.773 [1,5] 3.55 4.48 1.22
BlurM(or)e 10% extra ratings 6040 3699 1.000.395 [1,5] 3.57 4.48 1.16

4.2 Comparison of BlurMe and BlurM(or)e 10% extra ratings). We can see in Table 3 that the RMSE is decreas-
We compare the performance of our new obfuscation mechanism, ing with an increase in obfuscation. BlurMe [11] discovered the
BlurM(or)e, with the original obfuscation mechanism BlurMe. The same effect and explained that this might be due to the density of
performance is measured, in line with the experiments in BlurMe, the obfuscated data. Since BlurM(or)e does not increase the overall
by the classification accuracy of a logistic regression model that is density of the data, an alternative explanation can be found. The
trained on unaltered data, and tested on obfuscated data. The per- reason, lies perhaps, in increasing the density of users with few
formance is cross-validated using 10-fold cross-validation. Table 2 ratings.
shows that BlurM(or)e performs similarly to BlurMe. The more
obfuscation is applied to the data set, the lower the classification Extra ratings
accuracy is. Note that Table 2 contains the reproduction of BlurMe Obfuscation 0% 1% 5% 10%
that is discussed in detail in section 5. A big advantage of BlurM(or)e Original 0.8766 — — —
BlurMe 0.8766 0.8686 0.8553 0.8385
BlurM(or)e 0.8766 0.8711 0.8640 0.8468
Extra ratings
Table 3: The RMSE performance with Matrix Factorization
Classifier Data set 0% 1% 5% 10%
on the original data, BlurMe data and on BlurM(or)e data.
Logistic Regression BlurMe 0.76 0.54 0.15 0.02
Logistic Regression BlurM(or)e 0.76 0.64 0.36 0.19
Random Classifier Original 0.50 0.50 0.50 0.50 5 BLURME REPRODUCTION IN DETAIL
Table 2: Gender inference results measured in accuracy on Since we did not have the code of the original BlurMe [11], we
BlurMe (reproduction) and BlurM(or)e reimplemented it in order to carry out the comparison in this pa-
per. Because the paper was not specific about the settings of all
parameters, it is not possible to create an exact replication. For com-
pleteness, we discuss our reimplementation here, so that authors
is that an attacker cannot easily see that the data set is obfuscated. building on our work have the complete details.
Figure 1 on the previous page shows the number of ratings per
movie for 20 different movies in the MovieLens 1m data set. The 5.1 Gender Inference
red bar corresponds to the number of ratings for the movie with This section describes our reimplementation of the gender inference
ID 27. After the BlurMe obfuscation is applied, the red bar spans models. We create the user-item matrix by associating every user
approximately ten times its original size. This makes the attacker with a vector of ratings: x i with i being the index of the user and
suspicious and indicates that the data set is obfuscated. However, x i, j being the rating of user i for movie j. If the movie is not rated,
if the BlurM(or)e obfuscation is applied, the red bar only doubles we set x i, j = 0. This results in a U x I matrix, where U is the number
its size, which is less noticeable. Also, BlurM(or)e has more similar of users and I is the number of items. Every user vector is associated
statistics to the original data. Table 1 shows that BlurM(or)e keeps with a gender, that will serve as target label for the classifier.
the number of interactions as well as the density similar to the Following the experiments of [11], all classifiers are trained and
original MovieLens data set, while BlurMe produces a more dense tested on this user-item matrix with 10-fold cross-validation. We
data set with more interactions. do not have information about the splits that were used, so we use
The reduction part of BlurM(or)e has a less noticeable effect on our own splits. The ROC area under the curve as well as precision
the data set. Since the ratings are removed randomly from users and recall are reported as performance measures. A comparison of
with an extreme number of ratings, the number of ratings per the results can be seen in Table 4. The SVM uses a linear kernel
movies distribution does not change dramatically (the bar with ID and a C value of 1. For the Bernoulli classifier, the user-item matrix
20 shrinks ≈ 10% of its original size in the BlurM(or)e data set). is transformed, so that every rating x i, j that is greater than 0, is
set to 1. This means that the Bernoulli Bayes classifier ignores the
4.3 Recommendation Performance value of the rating and only uses information about whether a user
Using a well known collaborative filtering technique, Matrix Factor- i rated the movie j or not. All remaining parameters for the other
ization [4], similar to BlurMe, we notice that the change in RMSE is classifiers are set to the default values.
not substantial. The change has a maximum of 0.0298 for Movielens There is about a 4% difference between the scores reported in
with BlurM(or)e and 0.0381 for BlurMe (with greedy strategy and the original BlurMe paper [11], and those we measured with our
RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark Christopher Strucks, Manel Slokom, and Martha Larson

reproduction. Further exploration revealed that normalization, in no difference between these approaches, we chose to set the extra
terms of scaling all ratings from values in [0, 5] to values in [0, 1], ratings for a movie according to its respective overall average rating.
can have a large impact on scores. We do not focus on normalization This average is rounded, because only integer ratings are valid.
further here, but point out its impact because it suggests that there The authors of BlurMe take the following attack protocol into ac-
are parameters that could have been adjusted that are not explicitly count: A gender inference model is trained on real, non-obfuscated
recorded in [11]. In this paper, we have chosen to focus on the data and tested on the obfuscated data. For this reason, the gender
logistic regression model, since it is the fastest and achieves the inference model is trained on unaltered data and tested on obfus-
best performance. cated data. They use 10-fold cross-validation and report the average
classification accuracy of the model.
BlurMe results Reproduction Results We report results achieved by our BlurMe reproduction in Table 5.
Classifier AUC P/R AUC P/R The reproduction is generally congruent with the original. The
Bernoulli 0.81 0.79/0.76 0.77 0.88/0.48 difference is negligible, we can see that the classification accuracy
Multinomial 0.84 0.80/0.76 0.81 0.89/0.77 decreases if the obfuscation increases.
SVM 0.86 0.78/0.77 0.79 0.83/0.82
Logistic Regression 0.85 0.80/0.80 0.81 0.84/0.83 Extra ratings
Table 4: Gender inference results for both, BlurMe and the Strategy 0% 1% 5% 10%
reproduction thereof. The performance is measured in ROC Random 0.802 0.776 0.715 0.611
BlurMe Sampled 0.802 0.752 0.586 0.355
AUC, precision and recall.
Greedy 0.802 0.577 0.173 0.025
Random 0.76 0.74 0.69 0.62
Reproduction Sampled 0.76 0.71 0.58 0.33
Note that Table 4 uses ROC AUC as performance metric and Ta- Greedy 0.76 0.54 0.15 0.02
ble 2 uses classification accuracy. This choice was made by BlurMe Random 0.81 0.80 0.78 0.76
and for the sake of comparing the models, we did the same. Reproduction, Normalized Sampled 0.81 0.80 0.78 0.75
Greedy 0.81 0.78 0.74 0.70
5.2 Gender Obfuscation Table 5: Performance of BlurMe’s and the reproduction’s ob-
This section describes our reimplementation of the obfuscation fuscation algorithm measured by classification accuracy.
approach of BlurMe [11]. Recall that the basic idea of BlurMe is
to add fictional ratings to every user that are atypical for the their 6 DISCUSSION & CONCLUSION
gender. BlurMe [11] creates two lists, L f and Lm , of atypical movies In conclusion, this work points to a weakness in a state-of-the-art
for each gender by training and cross-validating a logistic regression gender obfuscation algorithm, BlurMe [11], and presents an im-
model on the training set. The movies in L f and Lm are ranked proved algorithm, BlurM(or)e, that addresses the issue. BlurM(or)e
according to their average rank across the folds. The rank of a is shown to be able to obfuscate gender in the user-item matrix
movie within a fold is determined by its coefficient that is learned without substantial increase in RMSE. In other words, it keeps the
by the logistic regression model. The lists L f and Lm also include utility of the data set intact. This work has shed light on some of
the average coefficient over all folds for each movie that serve as the challenges of gender obfuscation.
correlation metric between the movie and the user’s gender. We finish with a discussion of points from [11] that should be
After these lists are created, BlurMe takes every user profile and taken into account in future research. As mentioned before, normal-
adds k fictive ratings to the profile for movies from the opposite ization of the data set can have an enormous impact on the classifi-
gender list. The parameter k limits the number of extra ratings and cation performance. In Table 5, we see that when our reproduction
is set to 1%, 5% or 10% in the original experiments. A male user incorporates normalization the accuracy of gender inference still
with 100 ratings in the original data set would be obfuscated by decreases with increasing obfuscation, but at a much slower rate.
adding 5 (for k = 5%) fictive ratings from the female list. In addition, BlurMe used the ROC area under the curve metric for
There are some design choices left: Which movies should be the first gender inference experiments, yet changed to classification
selected from the lists and what should the fictive rating be? The accuracy for the gender inference on the obfuscated data set. Using
authors of BlurMe [11] proposed three different selection strategies accuracy as a performance metric on imbalanced data sets is a
for the first problem: the Random Strategy, the Sampled Strategy practice that should be avoided. It is advised to report the ROC
and the Greedy Strategy. The Random Strategy chooses k movies AUC, precision-recall AUC and ROC AUC on skew-normalized
uniformly at random from the list, the Sampled Strategy chooses data when dealing with imbalanced data sets [3].
k movies randomly, but in line with the score distribution of the Finally, BlurMe declares (in [11]) the classification accuracy of
movies. Thus, a movie that has a high coefficient is more likely to 2.5% as a success. One can argue that the gender is only truly
be added. Finally, the Greedy Strategy chooses the movie with the obfuscated if an attacking model achieves the same performance as
highest score. The authors do not mention the length of the lists, a random classifier (i.e., exactly 50% accuracy, in the case of binary
thus we chose to include all movies with a positive coefficient in classification). This point should be taken into account in deciding
the L f list, and all movies with a negative coefficient in the Lm list. the operational settings for BlurMe or BlurM(or)e. The decision
For the fictive rating of a user A for a movie B, BlurMe suggests also needs to consider the ease with which it is possible to detect
using either the average rating for movie B or the predicted rating whether a user’s data has been obfuscated. Future work will study
for user A for movie B. Since [11] reports that there is almost possibilities for obfuscating obfuscation.
BlurM(or)e RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark

REFERENCES [6] Yongsheng Liu, Hong Qu, Wenyu Chen, and SM Hasan Mahmud. 2019. An
[1] Shlomo Berkovsky, Yaniv Eytani, Tsvi Kuflik, and Francesco Ricci. 2007. Enhanc- Efficient Deep Learning Model to Infer User Demographic Information From
ing Privacy and Preserving Accuracy of a Distributed Collaborative Filtering. In Ratings. IEEE Access 7 (2019), 53125–53135.
Proceedings of the 2007 ACM Conference on Recommender Systems (RecSys ’07). [7] Roger McNamee and Sandy Parakilas. 2018. The Face-
ACM, 9–16. book breach makes it clear: data must be regulated, The
[2] Shlomo Berkovsky, Tsvi Kuflik, and Francesco Ricci. 2012. The Impact of Data Guardian. https://www.theguardian.com/commentisfree/2018/mar/19/
Obfuscation on the Accuracy of Collaborative Filtering. Expert Systems with facebook-data-cambridge-analytica-privacy-breach, Online; accessed 05-July-
Applications 39, 5 (2012), 5033–5042. 2019.
[3] László A Jeni, Jeffrey F Cohn, and Fernando De La Torre. 2013. Facing Imbalanced [8] Rupa Parameswara and Douglas M. Blough. 2007. Privacy Preserving Collabora-
Data—Recommendations for the Use of Performance Metrics. In 2013 Humaine tive Filtering Using Data Obfuscation. In 2007 IEEE International Conference on
Association Conference on Affective Computing and Intelligent Interaction. IEEE, Granular Computing (GRC ’07). IEEE, 380–380.
245–251. [9] Sravana Reddy and Kevin Knight. 2016. Obfuscating Gender in Social Media
[4] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech- Writing. In Proceedings of the 2016 EMNLP Workshop on NLP and Computational
niques for Recommender Systems. IEEE Computer Society Press 42, 8 (2009), Social Science. ACL, 17–26.
30–37. [10] Vicenç Torra. 2017. Data Privacy: Foundations, New Developments and the Big
[5] Dongsheng Li, Qin Lv, Li Shang, and Ning Gu. 2017. Efficient Privacy-Preserving Data Challenge. Springer International Publishing, Cham, 191–238.
Content Recommendation for Online Social Communities. Neurocomputing 219 [11] Udi Weinsberg, Smriti Bhagat, Stratis Ioannidis, and Nina Taft. 2012. BlurMe:
(2017), 440–454. Inferring and Obfuscating User Gender Based on Ratings. In Proceedings of the
2012 ACM Conference on Recommender Systems (RecSys ’12). ACM, 195–202.