BlurM(or)e: Revisiting Gender Obfuscation
                                     in the User-Item Matrix∗
             Christopher Strucks                                            Manel Slokom                                        Martha Larson
                Radboud University                                             TU Delft                              Radboud University and TU Delft
                    Netherlands                                              Netherlands                                      Netherlands
              chr.strucks@gmail.com                                      m.slokom@tudelft.nl                               m.larson@cs.ru.nl

ABSTRACT                                                                                  Obfuscation is an important tool to maintaining user privacy,
Past research has demonstrated that removing implicit gender in-                       alongside other tools such as encryption. Obfuscation is widely
formation from the user-item matrix does not result in substantial                     studied in other areas, but does not receive a great amount of
performance losses. Such results point towards promising solutions                     attention in the area of recommender systems, exceptions are [2,
for protecting users’ privacy without compromising prediction                          8]. Obfuscation can be added to the user-item matrix by users
performance, which are of particular interest in multistakeholder                      themselves, freeing them from an absolute dependency on the
environments. Here, we investigate BlurMe, a gender obfuscation                        service provider to secure their data and use it properly. In [1, 2]
technique that has been shown to block classifiers from inferring                      the user can decide what data to reveal and how much protection is
binary gender from users’ profiles. We first point out a serious                       put on the data. Even trusted service providers can have issues, such
shortcoming of BlurMe: Simple data visualizations can reveal that                      as breaches, or data being acquired and used inappropriately [7].
BlurMe has been applied to a data set, including which items have                         The main contributions of this paper are:
been impacted. We then propose an extension to BlurMe, called                                • A discussion of a flaw we discovered in BlurMe.
BlurM(or)e, that addresses this issue. We reproduce the original                             • An extension to BlurMe, called BlurM(or)e, that addresses
BlurMe experiments with the MovieLens data set, and point out                                  this issue.
the relative advantages of BlurM(or)e.                                                       • A set of experiments, whose results demonstrate the ability
                                                                                               of BlurM(or)e to obfuscate binary gender in the user-item ma-
CCS CONCEPTS                                                                                   trix with minimal impact on recommendation performance.
• Information systems → Recommender systems.                                              The paper is organized as follows. In Section 2, we cover the
                                                                                       related work, before going on to present the shortcoming of BlurMe
KEYWORDS                                                                               and our proposed improvement BlurM(or)e in Section 3. Next, we
Recommender Systems, Privacy, Data Obfuscation                                         present our experiments and results in Section 4, and in Section 5
                                                                                       we discuss our reproduction of BlurMe1 . We finish in section 6 with
                                                                                       a discussion and conclusion.
1    INTRODUCTION
When users rate, or otherwise interact with items, they may be                         2     BACKGROUND AND RELATED WORK
aware that they are providing a recommender system with pref-                          In this section, we discuss work most closely related to our own.
erence information. Less likely, is, however, that users know that
interaction information can implicitly hold sensitive personal in-                     2.1     Obfuscating the User-Item matrix
formation. In this paper, we focus on the problem of binary gender
information in the user-item matrix, which can be inferred by using                    In order to protect user demographic information in the user-item
a gender classifier. The state of the art in gender obfuscation for                    matrix, researchers have suggested data obfuscation. Data obfusca-
recommender system data, is to our knowledge, represented by                           tion (a.k.a. data masking) describes the process of hiding the original,
Weinsberg et. al. [11], who propose a gender obfuscation approach                      possibly sensitive data with modified or even fictional data [10].
for a user-item matrix of movie ratings, called BlurMe. Successful                     The goal is to protect the privacy of users, while maintaining the
obfuscation means that a user’s gender cannot be correctly inferred                    utility of the data. Data obfuscation can be done in several ways,
by a classifier that has been previously trained on other users’ rating                e.g., [9] used lexical substitution as obfuscation mechanism for text
data. BlurMe accomplishes this obfuscation without a substantial                       or [5] used user groups instead of individual users to hide personal
impact on the prediction performance of the recommender system                         information from the recommender system. In BlurMe [11], the
that is trained on the obfuscated data. Our study of BlurMe has                        authors found that it is possible to infer the gender of users from
revealed that it has a serious shortcoming. In this paper, we discuss                  their rating histories via basic machine learning classifiers. They
this issue, and propose an extension to BlurMe, called BlurM(or)e,                     proposed an algorithm, BlurMe, which successfully obfuscates the
that addresses it. We test BlurM(or)e against a reimplementation of                    gender of a user, thereby blocking gender inference. BlurMe ba-
BlurMe, reproducing experiments from [11].                                             sically adds ratings to every user profile that are typical for the
                                                                                       opposite gender, and is currently state of the art. The best perform-
∗ Copyright 2019 for this paper by its authors. Use permitted under Creative Commons
                                                                                       ing BlurMe obfuscation strategy, the greedy strategy, decreases the
License Attribution 4.0 International (CC BY 4.0).
Presented at the RMSE workshop held in conjunction with the 13th ACM Conference        1 The code for the reproduction as well as for BlurM(or)e and the exploratory analysis
on Recommender Systems (RecSys), 2019, in Copenhagen, Denmark.                         that we carried out is available at https://github.com/STrucks/BlurMore
RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark                        Christopher Strucks, Manel Slokom, and Martha Larson


accuracy of a logistic regression inference model from 80.2% on
the original data to 2.5% on the obfuscated data (adding 10% extra
ratings). The other proposed strategies have a smaller impact on
the classification accuracy. For this reason, in this work, we focus
on, and extend, the greedy strategy. Details and more explanations
about the gender inference and obfuscation process can be found
in section 5.

2.2    Inference on the User-Item matrix
The goal of BlurMe obfuscation is to protect against gender infer-
ence. The BlurMe [11] authors use basic machine learning models
that can successfully infer users’ gender from the user-item matrix.
The most recent work on inference on the user-item matrix is, to            Figure 1: #ratings per movie for the movies 15 to 35. The
our knowledge, that of [6], who developed a deep retentive learning         red bar indicates an example of obvious data obfuscation af-
framework that beats the conventional, standard machine learning            ter BlurMe is applied. The BlurMe data was created with the
approaches in the task of inferring user demographic information.           greedy strategy and with 10% extra ratings. The BlurM(or)e
For gender inference, [6] achieves a classification accuracy of 82%.        data contains also 10% extra ratings.
However, this is only 2% better than the standard logistic regression
model used in [11]. We adopt the model from [11] here since it is
sufficiently close to the state of the art for our purposes.                Based on these insights, we designed BlurM(or)e, which works as
                                                                            follows: We create, just like BlurMe, two lists of movies, L f and Lm ,
3 BUILDING A BETTER BLURME                                                  that correlate most strongly with females and males respectively.
3.1 The Issue with BlurMe                                                   After that, we alter every user profile by adding movies from the op-
                                                                            posite gender list with the greedy strategy proposed in BlurMe [11].
BlurMe [11] proposes a powerful algorithm that can obfuscate the
                                                                            However, if a movie has already doubled its initial rating count,
gender of a user. However, BlurMe has an important flaw: If the
                                                                            it will be removed from the list. (We use ×2, i.e., doubling, in this
rating frequency of the movies are visualized, it is possible to deter-
                                                                            paper because it works well, and leave exploration of other possible
mine that BlurMe has been applied to the data set, and to identify
                                                                            values to future work). Also, we keep track of the number of added
the movies for which ratings have been added. In figure 1(A), the
                                                                            ratings, so that we can remove the same number later on. After
rating frequency is shown for 20 items from the MovieLens data
                                                                            every user has received extra ratings up to a fixed percentages of
set before obfuscation. In figure 1(B), the rating frequency is shown
                                                                            their original ratings, we remove ratings from users that have rated
for the same 20 items after obfuscation with BlurMe. BlurMe ex-
                                                                            a lot of movies (here we choose ⩾ 200 movies, although future
hibits sharp spikes of items; here, it is item ID 27 (called Persuasion),
                                                                            work could investigate other values). The idea is that these users
which is marked in red. These spikes indicate that BlurMe has been
                                                                            provide already enough data for the gender classifier, so removing
applied, and point to the movies for which ratings have been added.
                                                                            some of their ratings would not impact the classifier. This idea is
There are two dangers associated with these spikes. First, if BlurMe
                                                                            also inspired by our exploratory analysis, which revealed that the
is running at an operating point of 10% extra ratings using the
                                                                            gender classifier does not benefit from additional data once a user
greedy strategy, as mentioned above, then the gender inference
                                                                            has already provided 200 ratings. This removal would be more dif-
accuracy is 2.5%. This means that if the information is known that
                                                                            ficult to diagnose in the user-item matrix, since exact information
BlurMe has been applied, it is simple to reverse the decision of the
                                                                            of the rating rates about users would need to be available.
classifier, and gender can be known with an accuracy of 97.5%. Sec-
ond, if we do not know the operating point of BlurMe (<10% extra
ratings will not guarantee us a gender classification accuracy that
                                                                            4 EXPERIMENTS AND MAIN RESULTS
we can reverse), we still can find the spikes in the rating histogram,      4.1 Data
and attempt to reverse BlurMe. In order to find a BlurMe spike              This study uses the publicly available MovieLens data set2 . We
we would look for movies that are known not to be particularly              chose MovieLens 1M, which is also used by BlurMe [11], whose
popular, but still have a lot of ratings in the BlurMe data. In this        work we are reproducing and extending. MovieLens 1M contains
paper, we focus on addressing the first danger, and leave the second        3.7K movies and about 1M ratings of 6K different users, and also
to future work.                                                             information on binary user gender. It is important to note that the
                                                                            distribution in the data set is unbalanced: there are 4331 males
3.2    The Definition of BlurM(or)e                                         that produced 750K ratings and 1709 females that produced 250K
BlurM(or)e was inspired by an exploratory analysis that we carried          ratings. Statistics of the original and the obfuscated data sets, are
out, which revealed that a large number of movies are indicative of         summarized in Table 1. We note that the number of items decreases
a gender. For this reason, it is not necessary to restrict the algorithm    for BlurM(or)e data sets due to the fact that the algorithm might
to add ratings only to the most correlated movies (like the greedy          remove all ratings of a certain movies by accident.
strategy of BlurMe does). This means that we can mask the data
without heavily relying on a small set of movies indicative of gender.      2 https://grouplens.org/datasets/movielens/
BlurM(or)e                                                                RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark

                                Table 1: Statistics of the data sets used in our experiments and analysis.
                              data set              #Users   #Items   #Ratings       Range    Av.rating   Density(%)     Variance
                            MovieLens 1m             6040     3706    1.000.209       [1,5]     3.58         4.47          1.25
                       BlurMe 1% extra ratings       6040     3706    1.013.416       [1,5]     3.58         4.53          1.25
                       BlurMe 5% extra ratings       6040     3706    1.052.886       [1,5]     3.58         4.70          1.20
                       BlurMe 10% extra ratings      6040     3706    1.099.545       [1,5]     3.57         4.91          1.16
                     BlurM(or)e 1% extra ratings     6040     3705    1.000.797       [1,5]     3.57         4.47          1.24
                     BlurM(or)e 5% extra ratings     6040     3700    1.000.773       [1,5]     3.55         4.48          1.22
                     BlurM(or)e 10% extra ratings    6040     3699    1.000.395       [1,5]     3.57         4.48          1.16


4.2    Comparison of BlurMe and BlurM(or)e                                     10% extra ratings). We can see in Table 3 that the RMSE is decreas-
We compare the performance of our new obfuscation mechanism,                   ing with an increase in obfuscation. BlurMe [11] discovered the
BlurM(or)e, with the original obfuscation mechanism BlurMe. The                same effect and explained that this might be due to the density of
performance is measured, in line with the experiments in BlurMe,               the obfuscated data. Since BlurM(or)e does not increase the overall
by the classification accuracy of a logistic regression model that is          density of the data, an alternative explanation can be found. The
trained on unaltered data, and tested on obfuscated data. The per-             reason, lies perhaps, in increasing the density of users with few
formance is cross-validated using 10-fold cross-validation. Table 2            ratings.
shows that BlurM(or)e performs similarly to BlurMe. The more
obfuscation is applied to the data set, the lower the classification                                         Extra ratings
accuracy is. Note that Table 2 contains the reproduction of BlurMe                    Obfuscation       0%     1%      5%    10%
that is discussed in detail in section 5. A big advantage of BlurM(or)e                  Original     0.8766   —       —      —
                                                                                          BlurMe      0.8766 0.8686 0.8553 0.8385
                                                                                        BlurM(or)e    0.8766 0.8711 0.8640 0.8468
                                                Extra ratings
                                                                               Table 3: The RMSE performance with Matrix Factorization
         Classifier          Data set     0%      1%     5%   10%
                                                                               on the original data, BlurMe data and on BlurM(or)e data.
      Logistic Regression    BlurMe      0.76    0.54 0.15 0.02
      Logistic Regression   BlurM(or)e   0.76    0.64 0.36 0.19
      Random Classifier      Original    0.50    0.50 0.50 0.50                5     BLURME REPRODUCTION IN DETAIL
Table 2: Gender inference results measured in accuracy on                      Since we did not have the code of the original BlurMe [11], we
BlurMe (reproduction) and BlurM(or)e                                           reimplemented it in order to carry out the comparison in this pa-
                                                                               per. Because the paper was not specific about the settings of all
                                                                               parameters, it is not possible to create an exact replication. For com-
                                                                               pleteness, we discuss our reimplementation here, so that authors
is that an attacker cannot easily see that the data set is obfuscated.         building on our work have the complete details.
Figure 1 on the previous page shows the number of ratings per
movie for 20 different movies in the MovieLens 1m data set. The                5.1     Gender Inference
red bar corresponds to the number of ratings for the movie with                This section describes our reimplementation of the gender inference
ID 27. After the BlurMe obfuscation is applied, the red bar spans              models. We create the user-item matrix by associating every user
approximately ten times its original size. This makes the attacker             with a vector of ratings: x i with i being the index of the user and
suspicious and indicates that the data set is obfuscated. However,             x i, j being the rating of user i for movie j. If the movie is not rated,
if the BlurM(or)e obfuscation is applied, the red bar only doubles             we set x i, j = 0. This results in a U x I matrix, where U is the number
its size, which is less noticeable. Also, BlurM(or)e has more similar          of users and I is the number of items. Every user vector is associated
statistics to the original data. Table 1 shows that BlurM(or)e keeps           with a gender, that will serve as target label for the classifier.
the number of interactions as well as the density similar to the                    Following the experiments of [11], all classifiers are trained and
original MovieLens data set, while BlurMe produces a more dense                tested on this user-item matrix with 10-fold cross-validation. We
data set with more interactions.                                               do not have information about the splits that were used, so we use
    The reduction part of BlurM(or)e has a less noticeable effect on           our own splits. The ROC area under the curve as well as precision
the data set. Since the ratings are removed randomly from users                and recall are reported as performance measures. A comparison of
with an extreme number of ratings, the number of ratings per                   the results can be seen in Table 4. The SVM uses a linear kernel
movies distribution does not change dramatically (the bar with ID              and a C value of 1. For the Bernoulli classifier, the user-item matrix
20 shrinks ≈ 10% of its original size in the BlurM(or)e data set).             is transformed, so that every rating x i, j that is greater than 0, is
                                                                               set to 1. This means that the Bernoulli Bayes classifier ignores the
4.3    Recommendation Performance                                              value of the rating and only uses information about whether a user
Using a well known collaborative filtering technique, Matrix Factor-           i rated the movie j or not. All remaining parameters for the other
ization [4], similar to BlurMe, we notice that the change in RMSE is           classifiers are set to the default values.
not substantial. The change has a maximum of 0.0298 for Movielens                   There is about a 4% difference between the scores reported in
with BlurM(or)e and 0.0381 for BlurMe (with greedy strategy and                the original BlurMe paper [11], and those we measured with our
RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark                     Christopher Strucks, Manel Slokom, and Martha Larson


reproduction. Further exploration revealed that normalization, in          no difference between these approaches, we chose to set the extra
terms of scaling all ratings from values in [0, 5] to values in [0, 1],    ratings for a movie according to its respective overall average rating.
can have a large impact on scores. We do not focus on normalization        This average is rounded, because only integer ratings are valid.
further here, but point out its impact because it suggests that there         The authors of BlurMe take the following attack protocol into ac-
are parameters that could have been adjusted that are not explicitly       count: A gender inference model is trained on real, non-obfuscated
recorded in [11]. In this paper, we have chosen to focus on the            data and tested on the obfuscated data. For this reason, the gender
logistic regression model, since it is the fastest and achieves the        inference model is trained on unaltered data and tested on obfus-
best performance.                                                          cated data. They use 10-fold cross-validation and report the average
                                                                           classification accuracy of the model.
                            BlurMe results   Reproduction Results             We report results achieved by our BlurMe reproduction in Table 5.
          Classifier        AUC     P/R      AUC         P/R               The reproduction is generally congruent with the original. The
           Bernoulli        0.81 0.79/0.76   0.77     0.88/0.48            difference is negligible, we can see that the classification accuracy
         Multinomial        0.84 0.80/0.76   0.81     0.89/0.77            decreases if the obfuscation increases.
             SVM            0.86 0.78/0.77   0.79     0.83/0.82
      Logistic Regression   0.85 0.80/0.80   0.81     0.84/0.83                                                               Extra ratings
Table 4: Gender inference results for both, BlurMe and the                                                 Strategy    0%       1%     5%      10%
reproduction thereof. The performance is measured in ROC                                                   Random     0.802   0.776 0.715     0.611
                                                                                      BlurMe               Sampled    0.802   0.752 0.586     0.355
AUC, precision and recall.
                                                                                                            Greedy    0.802   0.577 0.173     0.025
                                                                                                           Random     0.76     0.74   0.69    0.62
                                                                                   Reproduction            Sampled    0.76     0.71   0.58    0.33
   Note that Table 4 uses ROC AUC as performance metric and Ta-                                             Greedy    0.76     0.54   0.15    0.02
ble 2 uses classification accuracy. This choice was made by BlurMe                                         Random     0.81     0.80   0.78    0.76
and for the sake of comparing the models, we did the same.                  Reproduction, Normalized       Sampled    0.81     0.80   0.78    0.75
                                                                                                            Greedy    0.81     0.78   0.74    0.70
5.2      Gender Obfuscation                                                Table 5: Performance of BlurMe’s and the reproduction’s ob-
This section describes our reimplementation of the obfuscation             fuscation algorithm measured by classification accuracy.
approach of BlurMe [11]. Recall that the basic idea of BlurMe is
to add fictional ratings to every user that are atypical for the their     6   DISCUSSION & CONCLUSION
gender. BlurMe [11] creates two lists, L f and Lm , of atypical movies     In conclusion, this work points to a weakness in a state-of-the-art
for each gender by training and cross-validating a logistic regression     gender obfuscation algorithm, BlurMe [11], and presents an im-
model on the training set. The movies in L f and Lm are ranked             proved algorithm, BlurM(or)e, that addresses the issue. BlurM(or)e
according to their average rank across the folds. The rank of a            is shown to be able to obfuscate gender in the user-item matrix
movie within a fold is determined by its coefficient that is learned       without substantial increase in RMSE. In other words, it keeps the
by the logistic regression model. The lists L f and Lm also include        utility of the data set intact. This work has shed light on some of
the average coefficient over all folds for each movie that serve as        the challenges of gender obfuscation.
correlation metric between the movie and the user’s gender.                   We finish with a discussion of points from [11] that should be
   After these lists are created, BlurMe takes every user profile and      taken into account in future research. As mentioned before, normal-
adds k fictive ratings to the profile for movies from the opposite         ization of the data set can have an enormous impact on the classifi-
gender list. The parameter k limits the number of extra ratings and        cation performance. In Table 5, we see that when our reproduction
is set to 1%, 5% or 10% in the original experiments. A male user           incorporates normalization the accuracy of gender inference still
with 100 ratings in the original data set would be obfuscated by           decreases with increasing obfuscation, but at a much slower rate.
adding 5 (for k = 5%) fictive ratings from the female list.                   In addition, BlurMe used the ROC area under the curve metric for
   There are some design choices left: Which movies should be              the first gender inference experiments, yet changed to classification
selected from the lists and what should the fictive rating be? The         accuracy for the gender inference on the obfuscated data set. Using
authors of BlurMe [11] proposed three different selection strategies       accuracy as a performance metric on imbalanced data sets is a
for the first problem: the Random Strategy, the Sampled Strategy           practice that should be avoided. It is advised to report the ROC
and the Greedy Strategy. The Random Strategy chooses k movies              AUC, precision-recall AUC and ROC AUC on skew-normalized
uniformly at random from the list, the Sampled Strategy chooses            data when dealing with imbalanced data sets [3].
k movies randomly, but in line with the score distribution of the             Finally, BlurMe declares (in [11]) the classification accuracy of
movies. Thus, a movie that has a high coefficient is more likely to        2.5% as a success. One can argue that the gender is only truly
be added. Finally, the Greedy Strategy chooses the movie with the          obfuscated if an attacking model achieves the same performance as
highest score. The authors do not mention the length of the lists,         a random classifier (i.e., exactly 50% accuracy, in the case of binary
thus we chose to include all movies with a positive coefficient in         classification). This point should be taken into account in deciding
the L f list, and all movies with a negative coefficient in the Lm list.   the operational settings for BlurMe or BlurM(or)e. The decision
   For the fictive rating of a user A for a movie B, BlurMe suggests       also needs to consider the ease with which it is possible to detect
using either the average rating for movie B or the predicted rating        whether a user’s data has been obfuscated. Future work will study
for user A for movie B. Since [11] reports that there is almost            possibilities for obfuscating obfuscation.
BlurM(or)e                                                                             RMSE workshop at RecSys’19, September 2019, Copenhagen, Danmark


REFERENCES                                                                                   [6] Yongsheng Liu, Hong Qu, Wenyu Chen, and SM Hasan Mahmud. 2019. An
[1] Shlomo Berkovsky, Yaniv Eytani, Tsvi Kuflik, and Francesco Ricci. 2007. Enhanc-              Efficient Deep Learning Model to Infer User Demographic Information From
    ing Privacy and Preserving Accuracy of a Distributed Collaborative Filtering. In             Ratings. IEEE Access 7 (2019), 53125–53135.
    Proceedings of the 2007 ACM Conference on Recommender Systems (RecSys ’07).              [7] Roger McNamee and Sandy Parakilas. 2018.                           The Face-
    ACM, 9–16.                                                                                   book breach makes it clear: data must be regulated, The
[2] Shlomo Berkovsky, Tsvi Kuflik, and Francesco Ricci. 2012. The Impact of Data                 Guardian.          https://www.theguardian.com/commentisfree/2018/mar/19/
    Obfuscation on the Accuracy of Collaborative Filtering. Expert Systems with                  facebook-data-cambridge-analytica-privacy-breach, Online; accessed 05-July-
    Applications 39, 5 (2012), 5033–5042.                                                        2019.
[3] László A Jeni, Jeffrey F Cohn, and Fernando De La Torre. 2013. Facing Imbalanced         [8] Rupa Parameswara and Douglas M. Blough. 2007. Privacy Preserving Collabora-
    Data—Recommendations for the Use of Performance Metrics. In 2013 Humaine                     tive Filtering Using Data Obfuscation. In 2007 IEEE International Conference on
    Association Conference on Affective Computing and Intelligent Interaction. IEEE,             Granular Computing (GRC ’07). IEEE, 380–380.
    245–251.                                                                                 [9] Sravana Reddy and Kevin Knight. 2016. Obfuscating Gender in Social Media
[4] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech-              Writing. In Proceedings of the 2016 EMNLP Workshop on NLP and Computational
    niques for Recommender Systems. IEEE Computer Society Press 42, 8 (2009),                    Social Science. ACL, 17–26.
    30–37.                                                                                  [10] Vicenç Torra. 2017. Data Privacy: Foundations, New Developments and the Big
[5] Dongsheng Li, Qin Lv, Li Shang, and Ning Gu. 2017. Efficient Privacy-Preserving              Data Challenge. Springer International Publishing, Cham, 191–238.
    Content Recommendation for Online Social Communities. Neurocomputing 219                [11] Udi Weinsberg, Smriti Bhagat, Stratis Ioannidis, and Nina Taft. 2012. BlurMe:
    (2017), 440–454.                                                                             Inferring and Obfuscating User Gender Based on Ratings. In Proceedings of the
                                                                                                 2012 ACM Conference on Recommender Systems (RecSys ’12). ACM, 195–202.