=Paper= {{Paper |id=Vol-3012/OHARS2021-paper4 |storemode=property |title=An Empirical Analysis of Collaborative Recommender Systems Robustness to Shilling Attacks |pdfUrl=https://ceur-ws.org/Vol-3012/OHARS2021-paper4.pdf |volume=Vol-3012 |authors=Anu Shrestha,Francesca Spezzano,Maria Soledad Pera |dblpUrl=https://dblp.org/rec/conf/recsys/ShresthaSP21 }} ==An Empirical Analysis of Collaborative Recommender Systems Robustness to Shilling Attacks== https://ceur-ws.org/Vol-3012/OHARS2021-paper4.pdf
An Empirical Analysis of Collaborative
Recommender Systems Robustness to Shilling
Attacks
Anu Shrestha, Francesca Spezzano and Maria Soledad Pera
Boise State University, Boise, ID, USA


                                      Abstract
                                      Recommender systems play an essential role in our digital society as they suggest products to purchase,
                                      restaurants to visit, and even resources to support education. Recommender systems based on collab-
                                      orative filtering are the most popular among the ones used in e-commerce platforms to improve user
                                      experience. Given the collaborative environment, these recommenders are more vulnerable to shilling
                                      attacks, i.e., malicious users creating fake profiles to provide fraudulent reviews, which are deliberately
                                      written to sound authentic and aim to manipulate the recommender system to promote or demote target
                                      products or simply to sabotage the system. Therefore, understanding the effects of shilling attacks and
                                      the robustness of recommender systems have gained massive attention. However, empirical analysis
                                      thus far has assessed the robustness of recommender systems via simulated attacks, and there is a lack
                                      of evidence on what is the impact of fraudulent reviews in a real-world setting. In this paper, we present
                                      the results of an extensive analysis conducted on multiple real-world datasets from different domains
                                      to quantify the effect of shilling attacks on recommender systems. We focus on the performance of vari-
                                      ous well-known collaborative filtering-based algorithms and their robustness to different types of users.
                                      Trends emerging from our analysis unveil that, in the presence of spammers, recommender systems are
                                      not uniformly robust for all types of benign users.

                                      Keywords
                                      Recommender system, shilling attack, robustness, fraudulent reviews, Collaborative filtering




1. Introduction
The effect of shilling attacks on recommender systems, where malicious users create fake
profiles so that they can then manipulate algorithms by providing fake reviews or ratings, have
been long studied [1, 2, 3]. So far, recommender system researchers have: (1) Characterized
and modeled recommender system shilling attacks (where malicious users insert fake profiles
to manipulate recommendations), (2) Defined new metrics to quantify the impacts of these
attacks on commonly used recommender systems, and (3) Applied a detect + filtering approach
to mitigate the effects of spammers on recommendations. Nevertheless, we observe from the
literature that the analysis thus far has focused on assessing the robustness of recommender
systems via simulated attacks [4, 5]. Unfortunately, there is a lack of evidence on what is the

OHARS’21: Second Workshop on Online Misinformation- and Harm-Aware Recommender Systems, October 2, 2021,
Amsterdam, Netherlands
" anushrestha@u.boisestate.edu (A. Shrestha); francescaspezzano@boisestate.edu (F. Spezzano);
solepera@boisestate.edu (M. S. Pera)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)



                                                                                                       45
impact of fake reviews or fake ratings in a real-world setting.
    In this paper, we present an analysis conducted to understand the influence of fraudulent
reviews on the recommendation process in real-world scenarios. We do this through a study of
known datasets with gold standards in different domains and several commonly-used recom-
mendation algorithms. Specifically, we utilize data from two widely-used e-commerce platforms,
Yelp! and Amazon. Among various recommendation algorithms, we consider collaborative
filtering-based approaches as they are the most efficient and popular recommenders in such
platforms. Thus, we focused our exploration on the robustness of these algorithms to shilling
attacks.
    The main contribution of this paper is two-fold. First, we analyze the performance of widely
used five collaborative filtering-based algorithms in the presence of spammers and compared
them when spammers are removed. By doing so, we seek to answer whether shilling attacks
affect the robustness of the considered recommender algorithms. Second, we investigate if there
is a specific user group (non-mainstream users) that are affected more than others (mainstream
users) by spammers.
    Our results are validated by an empirical evaluation using classical measures for evaluating
predictive and top-N recommendation strategies. We show that RMSE scores decrease and
NDCG@5 scores increase when removing spammers in the majority of the considered algorithms
and datasets. This serves as an indication that the performances of considered collaborative-
filtering-based recommender algorithms are indeed affected by spam ratings/reviews. Further, a
deep investigation to quantify the effects of spammers on recommendations received by certain
groups of users lead us to conclude that, for the Yelp! datasets removing spammers improves
the predictive ability (RMSE) of all the considered recommender algorithms regardless of the
type of users, i.e., mainstream or non-mainstream. However, in the case of Amazon datasets,
we observed a trend where removing spammers lessen the predictive ability for mainstream
users based on RMSE whereas improves for non-mainstream users according to both RMSE
and NDCG@5. Therefore, non-mainstream users whose rating behavior does not align with
the majority of the users are the most affected ones by spam ratings for Amazon datasets.
Overall, we observed that 25%-29% of benign users in Amazon datasets are users who would
not be equally satisfied by recommenders affected by shilling attacks. Thus, recommender
algorithms are not uniformly robust for all types of benign users in the presence of spammers
ratings/reviews.
    The rest of this paper is organized as follows. In Section 2, we summarize related work; we
then outline the dataset, algorithms, and evaluation strategies used in our empirical analysis in
Section 3. In Section 4 we report on our results and, finally, conclusions are drawn in Section 5.


2. Related work
Collaborative filtering-based recommender systems are widely used to provide recommendations
to users in opinion-based systems, yet they are vulnerable to shilling attacks [6, 2]. These attacks
consist of fake user profiles injected into the system with the goal of providing spam ratings
or reviews to promote or demote specific products. While some Shilling attacks promote the
recommendations of certain attacked items (referred to as the push attack), others might demote



                                                46
the predictions that are made for attacked items (referred to as the nuke attack) [4, 7]. Previous
work has defined several attack strategies, including random, average, bandwagon, love/hate,
segmented, and probe attacks. These strategies differ in the way fake profiles choose filler items,
i.e., other rated items chosen beyond attacked items to camouflage the fraudulent behavior.
More sophisticated attacks have been recently proposed [6], for instance, the one by Fang et
al. [8] looks at how to choose filler items to recommend an attacker-chosen targeted item to as
many users as possible. In the field of machine learning, many efforts have been devoted over the
years to develop techniques for automatic detection of such fraudulent profiles; the techniques
presented in [9] and [10] are among the most recent ones. In the field of recommender systems,
researchers have focused on studying the effects of such shilling attacks mainly on collaborative
filtering-based recommenders since the early 2000s [11, 4] and developed strategies to make
such algorithms more robust to shilling attacks [2]. We highlight, for example, outcomes of
the research conducted by Seminario and Wilson [12, 13] who explicitly look at power user
and power items attacks, i.e., attacks targeting influential users and items, respectively, within
collaborative filtering-based recommender systems.
   Most recently, the concept of differential privacy has been explored to make matrix factorization-
based collaborative filtering recommender algorithms more robust [14]. The vulnerability of
deep-learning-based recommender systems to shilling attacks has been studied in [15]. In
particular, Lin et al. introduce a framework that considers complex attacks aimed towards
specific user groups. On a different perspective, Deldjoo et al. [16] explore dataset characteristics
to explain an observed change in the performance of recommendation under shilling attacks.
   Our work add to this body of knowledge by exploring the robustness of collaborative recom-
mender systems to shilling attacks by using real-world data with spam reviews ground truth, as
opposed to attack simulation and investigating if some users are more vulnerable than others.


3. Experimental Settings
As previously stated, our goal is to analyze commonly-used memory-based and model-based
collaborative filtering-based recommendation algorithms’ robustness to shilling attacks using a
number of datasets with ground truth on spam reviews. In the rest of this section, we describe
the experimental protocol for our analysis.

3.1. Datasets
For analysis purposes, we rely on four datasets (described below) produced based upon data
from two well-known e-commerce platforms: Yelp! and Amazon.

Yelp! We consider Yelp! reviews from two domains: hotels (YH) and restaurants (YR) [17].
Yelp filters fake/suspicious reviews and puts them on a spam list. A study found the Yelp filter
to be highly accurate [18], and many researchers have used filtered spam reviews as ground
truth for spammer detection [19, 20]. Spammers, in our case, are users who wrote at least one
filtered review. We removed users who rated the same products multiple times and reviews
with a rating of zero.




                                                47
                         Dataset     Users     Items    Ratings    Spammers
                          YH         5,027       72      5,857       14.92%
                           YR        34,523     129     66,060       20.25%
                           AB       167,725    29,004   252,056      3.57%
                          AH        311,636    39,539   428,781      4.12%
Table 1
Details on the datasets considered for our analysis.




Figure 1: Rating distribution across the four datasets considered in our analysis.


Amazon We also consider Amazon reviews from two domains: beauty (AB) and health
(AH) [21]. In this case, we define ground truth based on helpfulness votes following the
approach suggested by [9] and based on the findings provided by Fayazi et al. [22]. Thus, we
treat as a spammer every user who wrote at least one spam review. We define a review as spam
if the rating is 4 or 5 and the helpfulness ratio is ≤ 0.4.
   We provide descriptive statistics for the four datasets in Table 1. It is important to note that
rating distribution is not similar across the datasets. As illustrated in Figure 1, rating trends
from AH are dissimilar to the other counterparts, with a vast number of users rating only 1
item. Moreover, the rating distribution of benign users vs. spammers on attacked products (i.e.,
products receiving at least one spam review) is captured in Figure 2. From this figure, it emerges
that benign users and spammer counterparts exhibit similar rating behavior in YR, AB, and AH,
whereas in the case of YH, spammers noticeably assign ratings of ‘1’ more often than benign
counterparts; the opposite is true for ratings of ‘4’.

3.2. Algorithms
We focus our analysis on well-known and widely-used collaborative filtering-based recommen-
dation algorithms implemented using Lenskit for python [23], except for Probabilistic Matrix
Factorization, for which we relied on the implementation provided by Mnih et al. [24].




                                                  48
                  (a) Yelp!-Hotel                              (b) Yelp!-Restaurant




                (c) Amazon-Beauty                               (d) Amazon-Health

Figure 2: Rating distribution for attacked products by spammers and benign users.


Item-item [25] is the popular item-based collaborative filter algorithm. It utilizes an item-
item matrix to determine the similarity between the target item and other items (neighbors).
We used this algorithm with 20 neighbors and cosine similarity as similarity measure.

Probabilistic Matrix Factorization (PMF) [24] is a commonly-used latent factor-based
recommendation algorithm. Specifically, probabilistic matrix factorization decomposes the
sparse user-item matrix into low-dimensional matrices with latent factors to generate recom-
mendations. We used this algorithm with 40 latent factors and 150 iterations. This algorithm is
known for its accuracy, scalability, and dealing with sparsity.

Alternating Least Squares (ALS) [26] is a matrix factorization-based algorithm designed to
improve recommendation algorithms performance in large-scale collaborative filtering problems.
This algorithm gain recognition following its success on the Netflix Challenge [27, 28]. In our
case, we consider 40 latent factors, 5 damping factors, and 150 iterations for our experiment.

Bayesian Personalized Ranking (BPR) [29] is a rank-based matrix factorization algorithm,
with 40 latent factors, 5 damping factors, and 150 iterations for our experiment. Note that as
top-N recommendation algorithm, i.e., based on rating information is in the form of implicit
feedback [29, 30], BPR scores items, but does not produce rating predictions. Thus, we are
forced to exclude BPR from the RMSE-based analysis discussed in Section 4.

FunkSVD [31] is the well-known gradient descent matrix factorization technique with 40
latent features and 150 training iterations per feature.




                                                49
3.3. Evaluation Framework
By following the classical evaluation framework for shilling attacks on recommender systems [4],
we measured the performances on the original dataset (with spammers) and when we remove
all the spammers (shilling attack), using well-known performance metrics. In all cases, we
performed 5-fold cross-validation. We tested whether differences in the metric values with and
without spammers were statistically significant using a paired t-test.

Metrics. For assessment, we turn to Root Mean Square Error (RMSE) and Normalized Dis-
counted Cumulative Gain (NDCG), which are classical measures for evaluating predictive and
top-N recommendations. We also consider measures explicitly defined to quantify the impact of
spammer attacks on recommenders: Prediction Shift (PS), which captures the average absolute
changes in predicted ratings for attacked items and Hit Ratio (HR), which considers if attacked
items are promoted to user top-n recommendations (cf. Burke et al. [4] for formal definitions of
these metrics).
  We first examined recommender performance by considering all users in the respective
datasets. We then segmented users into fairness categories as computed by the Fairness and
Goodness algorithm described in the next paragraph in order to allow for more in-depth
explorations. By segmenting users based on fairness scores, it is possible to identify mainstream
and non-mainstream users. The latter are those whose rating patterns do not align with the
majority, i.e., liking what most people dislike and vice versa [32].

Fairness and Goodness The Fairness and Goodness algorithm (F&G) [33] provides a measure
for capturing user rating behavior. While many measures exist for this task [9], we chose to use
F&G as Serra et al. [10] recently show it to be the best measure to identify trustworthy users in
opinion-based systems. F&G computes a fairness score for each user and a goodness score for
each item. Specifically, the fairness 𝑓 (𝑢) of a user 𝑢 is a measure of how fair or trustworthy the
user is in rating items. Intuitively, a ‘fair’ or ‘trustworthy’ rater should give an item the rating
that it deserves, while an ‘unfair’ one would deviate from that value. In the case of benign users,
the latter could be the case of an uninformative or non-mainstream user. The goodness 𝑔(𝑖)
of an item 𝑖 specifies how much users in the system like the item and what its true quality is.
Fairness and goodness are mutually recursively computed as:

                                           1     ∑︁ |𝑊 (𝑢, 𝑖) − 𝑔(𝑖)|
                          𝑓 (𝑢) = 1 −                                                             (1)
                                        |𝑜𝑢𝑡(𝑢)|           𝑅
                                                 𝑖𝜖𝑜𝑢𝑡(𝑢)

                                           1    ∑︁
                              𝑔(𝑖) =               𝑓 (𝑢) × 𝑊 (𝑢, 𝑖)                               (2)
                                        |𝑖𝑛(𝑖)|
                                              𝑢𝜖𝑖𝑛(𝑖)

   where 𝑊 (𝑢, 𝑖) is the rating given by the user 𝑢 to the item 𝑖, 𝑜𝑢𝑡(𝑢) is the set of ratings given
by user 𝑢, 𝑖𝑛(𝑖) is the set of ratings received by item 𝑖, and 𝑅 = 4 in this case which corresponds
to the maximum error in a five-star rating system. Thus, the goodness of an item is given by
the average of its rating, where each rating is weighted by the fairness of the rater, while the
fairness of a user considers how much the ratings a user gives are far from the goodness of the



                                                 50
items. The higher the fairness, the more trustworthy the user is. Fairness scores of the user lie
in the [0, 1] interval, and goodness scores lie in the [1, 5] interval.
                                RMSE                    NDCG@5                      HR@5                 PS
 Dataset   Algorithm
                       W Spammers W/o Spammers   W Spammers W/o Spammers   W Spammer W/o Spammer   Attacked Items
           Item-Item   1.33       1.32           0.104      0.105          0.031      0.032        0.087
 YH
           PMF         1.125      1.124          0.57       0.57           0.0217     0.0216       0.119
           BPR                                   0.023      0.030          0.022      0.022        0.159
           ALS         1.028      1.020          0.041      0.043          0.0218     0.0217       0.143
           FunkSVD     1.023      1.019          0.034      0.032          0.022      0.022        0.129
           Item-Item   1.181      1.179          0.0073     0.0075         0.013      0.013        0.119
 YR
           PMF         1.040      1.037          0.56       0.56           0.014      0.014        0.122
           BPR                                   0.088      0.087          0.013      0.013        0.160
           ALS         0.971      0.970          0.049      0.051          0.013      0.013        0.148
           FunkSVD     0.993      0.981          0.012      0.013          0.0136     0.0137       0.138
           Item-Item   0.95       0.95           0.295      0.299          0.000140   0.000149     0.130
           PMF         0.912      0.905          0.552      0.553          0.000051   0.000051     0.121
 AB
           BPR                                   0.801      0.828          0.00023    0.00024      0.124
           ALS         0.802      0.802          0.265      0.264          0.0003     0.0003       0.116
           FunkSVD     0.637      0.644          0.028      0.032          0.0064     0.0061       0.11
           Item-Item   1.151      1.154          0.290      0.294          0.00013    0.00012      0.105
 AH        PMF         1.053      1.051          0.518      0.519          0.000036   0.000036     0.101
           BPR                                   0.794      0.827          0.00023    0.00024      0.104
           ALS         0.952      0.952          0.198      0.204          0.00033    0.00032      0.10
           FunkSVD     0.994      0.933          0.070      0.067          0.00298    0.00296      0.10

Table 2
Performance analysis using different metrics on datasets with and without spam. Statistically signifi-
cant differences are shaded in gray, 𝑝𝑣𝑎𝑙𝑢𝑒 ≤ 0.001.




4. Results and Discussion
In this section, we present our experimental evaluation of five recommendation algorithms on
four datasets of different domains. We discuss the effect of shilling attacks on recommendations
offered to users in real-world scenarios, as opposed to simulated attacks. Specifically, we aim to
answer the following research questions:

RQ1 Do spammer’s ratings impact recommendations?
RQ2 Who is really affected by spammers?

By investigating recommender algorithm performance in the presence of spammers as well as
when spammers are removed, the first question enables us to gauge the shilling attacks’ effect
on the robustness of the considered recommendation algorithms. For the second question, we
used the fairness metric to determine mainstream and non-mainstream users and quantify the
effect of spammers on recommendations received by non-mainstream users.

4.1. Do spammers ratings impact recommendations?
To answer RQ1, we consider the performance of the recommender algorithms yielded on four
different datasets, as reported in Table 2. It comes across from the reported scores that removing
spammers indeed leads to lower RMSE scores, i.e., better predictions. Previous works have
shown PS values ranging from 0.5 to 1.5 when shilling attacks are simulated [34]. However,



                                                        51
we observe very low values in real-world scenarios: in our case, considered, PS ranges from
0.087 to 0.160, which we argue might not be enough to promote or demote products attacked
by the spammers. When looking at algorithm performance from a top-N recommendation
standpoint, from reported NDCG@5 we see that, often, NDCG@5 scores tend to increase when
removing spammers. This means that users’ preferred items are more likely to appear within
the top-5 recommendations when spammers are excluded. Unfortunately, improvement is not
always meaningful, i.e., from Table 2 we see that improvements are not always significant,
especially on YH. We anticipated lower HR@5 scores when excluding spammers–we assumed
fewer attacked items would be promoted among the top-5 recommendations. Instead, we see
similar trends among HR@5 results as those observed for NDCG@5. In other words, for YH
and YR, performance is comparable regardless of the presence of spammers (i.e., differences in
performance are not significant); for AB and AH we see significant differences in performance.
   Overall, we can say that, in theory, the performance of collaborative filtering-based recom-
mender algorithms is affected by spammers’ ratings/reviews. This is particularly noticeable for
predictive recommenders (i.e., all algorithms yielded significant differences across the datasets).
In practice, however, performance improvements are in their majority barely perceptible. This
leads us to question whether algorithm robustness is reflected by average metrics like RMSE
or NDCG. In the end, looking at recommender performance as a whole may not clearly quan-
tify how much spammers are able to deceit recommenders and, more importantly, if there
are specific user groups that are affected more than others. With this in mind, we conduct a
more thorough analysis with the aim of understanding if the aforementioned differences in
performance are more pronounced among certain types of users (i.e., non-mainstream ones).

4.2. Who is really affected by spammers?
To better understand which users are really affected by spammers, we analyzed users based on
their fairness: the ability of a user to rate a product according to what it deserves. It is worth
noting, however, that the rating a product deserves often aligns with what the majority of
benign users (mainstream users) think about that product, as mainstream users often outnumber
non-mainstream and spam users. We investigate trends according to RMSE, NDCG@5, and hit
ratio. As noted in the prior subsection, prediction shift values were small, so we excluded this
metric from our analysis).
   Figure 3 illustrates how the RMSE varies according to the fairness of benign users; for ease of
readability, we highlight statistically significant differences in performance when spammers
are excluded in Table 3. We start by observing that, regardless of the algorithm for both Yelp!
datasets and with just one exception (YR, ALS, and FunkSVD, (0.4 − 0.5]), removing spammers
reduces the RMSE for all users. For the Amazon datasets, instead, when the user fairness
is greater than 0.4, removing spammers increases the RMSE for all users for each algorithm.
We posit these results could be due to the rating distributions of spammers vs. benign users
across these two platforms. As previously shown in Figure 2, spammer and benign users are
more similar in Amazon than Yelp!, with the majority of ratings being 4 and 5. Therefore,
removing spam could cause the recommender to lose information from mainstream users. On
the other end, when fairness is less than or equal to 0.4 among Amazon users, in most cases
where the difference is statistically significant, i.e., 14 out of 19 cases, removing spammers



                                                52
     YH




     YR




     AB




     AH


                          Item-Item                                  PMF                                     ALS                           FunkSVD
Figure 3: RMSE differences across fairness range.

Dataset   Algorithm        [0-0.1]       (0.1-0.2]     (0.2-0.3]     (0.3-0.4]     (0.4-0.5]     (0.5-0.6]     (0.6-0.7]     (0.7-0.8]     (0.8-0.9]      (0.9-1]
          #Benign Users    1             6             10            347           443           368           1281          550           882            389
          Item-Item                                                                                                                        𝑊 > 𝑊/𝑜 * *
          PMF                                                                                                  𝑊 > 𝑊/𝑜 * *                  𝑊 > 𝑊/𝑜 * *
YH
          ALS                                                                      𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *    𝑊 > 𝑊/𝑜 * *
          FunkSVD                                                                                𝑊 > 𝑊/𝑜*                    𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *    𝑊 > 𝑊/𝑜 * *
          #Benign Users    0             0             2             269           6502          4898          4580          7450          1889           2071
          Item-Item                                                                                                                        𝑊 > 𝑊/𝑜 * *
          PMF                                                                      𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *    𝑊 > 𝑊/𝑜 * *
YR
          ALS                                                                      𝑊/𝑜 > 𝑊 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *    𝑊 > 𝑊/𝑜 * *
          FunkSVD                                                                  𝑊/𝑜 > 𝑊 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *    𝑊 > 𝑊/𝑜 * *
          #Benign Users    1716          5352          13914         27947         25755         18632         16354         15353         11860          24861
          Item-Item                      𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                 𝑊/𝑜 > 𝑊 * *                               𝑊/𝑜 > 𝑊 * *
          PMF                            𝑊 > 𝑊/𝑜*                    𝑊 > 𝑊/𝑜*                                                              𝑊/𝑜 > 𝑊 * *    𝑊/𝑜 > 𝑊 * *
AB
          ALS              𝑊 > 𝑊/𝑜*      𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *                               𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *    𝑊/𝑜 > 𝑊 * *
          FunkSVD                                                                                                                          𝑊 > 𝑊/𝑜 * *
          #Benign Users    2574          6718          23273         46289         56630         41021         32512         30435         22363          36978
          Item-Item        𝑊 > 𝑊/𝑜 * *   𝑊/𝑜 > 𝑊 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *
          PMF                            𝑊/𝑜 > 𝑊 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *    𝑊/𝑜 > 𝑊 * *
AH
          ALS              𝑊/𝑜 > 𝑊 *     𝑊/𝑜 > 𝑊 *     𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *
          FunkSVD                                                                                𝑊 > 𝑊/𝑜*


Table 3
Statistically significant RMSE differences between recommendations without spammers (W/o) and
with spammers (W) across different user fairness ranges (* means 𝑝𝑣𝑎𝑙𝑢𝑒 < 0.03 and ** means
𝑝𝑣𝑎𝑙𝑢𝑒 ≤ 0.01). Cases where removing spammers reduces the RMSE are shaded.


enables algorithms to avoid noise signals and thus perform better for these users (lower RMSE).
Note that there are more cases in AH than AB (4 out of 11 vs. 1 out of 8) where removing the
spammers is not beneficial for non-mainstream users. This could be due to the fact that, as
shown in Figure 1, AH data is more sparse than other datasets, making the process of generating
recommendations more difficult for most of the users in such a setting, independently of the
presence of spam.
  When we look at trends for NDCG@5, Figure 4 and Table 4 show that the quality of the
generated recommendations seldom improves on the Yelp! datasets for users having fairness



                                                                                   53
 YH




 YR




 AB




 AH


                  Item-Item                                PMF                           BPR                               ALS                         FunkSVD
Figure 4: NDCG differences across fairness range.

Dataset    Algorithm   [0-0.1]   (0.1-0.2]     (0.2-0.3]     (0.3-0.4]     (0.4-0.5]     (0.5-0.6]     (0.6-0.7]     (0.7-0.8]     (0.8-0.9]     (0.9-1]
           Item-Item
           PMF                                                                                                                     𝑊/𝑜 > 𝑊 * *
YH
           BPR                                                             𝑊 > 𝑊/𝑜 * *
           ALS
           FunkSVD                                                         𝑊 > 𝑊/𝑜 * *
           Item-Item
           PMF                                                             𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *
YR
           BPR                                                                                         𝑊 > 𝑊/𝑜 * *
           ALS                                                                           𝑊 > 𝑊/𝑜 * *
           FunkSVD                                                                       𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *
           Item-Item             𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊/𝑜 > 𝑊 * *                 𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *
           PMF                   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *                 𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *                 𝑊/𝑜 > 𝑊 * *
AB
           BPR                                               𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *                 𝑊/𝑜 > 𝑊 * *
           ALS                                                             𝑊/𝑜 > 𝑊 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *
           FunkSVD
           Item-Item                           𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *
           PMF                                 𝑊/𝑜 > 𝑊 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                 𝑊/𝑜 > 𝑊 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *
AH
           BPR                   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *
           ALS                   𝑊/𝑜 > 𝑊 * *                 𝑊/𝑜 > 𝑊 * *                               𝑊 > 𝑊/𝑜 * *
           FunkSVD


Table 4
Statistically significant NDCG@5 differences between recommendations without spammers (W/o) and
with spammers (W) across different user fairness ranges (* means 𝑝𝑣𝑎𝑙𝑢𝑒 < 0.03 and ** means
𝑝𝑣𝑎𝑙𝑢𝑒 ≤ 0.01). Cases where removing spammers increases the NDCG@5 are shaded.


greater than 0.5, whereas for the Amazon datasets, the value of NDCG@5 is higher when
spammers are removed in the majority of the cases (28 out of 47) and independently of the user
type.
   Overall, our analysis reveals that removing spammers helps in reducing the number of
attacked items that hit the top-5 recommendations for all the users in all the datasets (see HR@5
analysis in Table 5).1 Moreover, removing spammers in Yelp! is beneficial for all the users when
considering the predictive performance of algorithms (based on RMSE); for Amazon, top-N
algorithms are better (according to NDCG@5) among mainstream users, who see more tailored
     1
         For brevity, we exclude a figure akin to those complementing Tables 3 and 4.



                                                                               54
Dataset   Algorithm   [0-0.1]       (0.1-0.2]     (0.2-0.3]     (0.3-0.4]     (0.4-0.5]     (0.5-0.6]     (0.6-0.7]     (0.7-0.8]     (0.8-0.9]     (0.9-1]
          Item-Item
          PMF                                                                 𝑊/𝑜 > 𝑊 * *                                             𝑊 > 𝑊/𝑜 * *
YH
          BPR                                                                 𝑊/𝑜 > 𝑊 * *                                             𝑊 > 𝑊/𝑜 * *
          ALS                                                                 𝑊/𝑜 > 𝑊 * *                                             𝑊 > 𝑊/𝑜 * *
          FunkSVD                                                             𝑊/𝑜 > 𝑊 * *                                             𝑊 > 𝑊/𝑜 * *
          Item-Item                                                                         𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜*
          PMF                                                                               𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *
YR
          BPR                                                                               𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜*
          ALS                                                                               𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *
          FunkSVD                                                                           𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜*
          Item-Item   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *
          PMF         𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *                 𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *   𝑊/𝑜 > 𝑊 * *
AB
          BPR         𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *
          ALS                       𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                               𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *
          FunkSVD     𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *
          Item-Item   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *
          PMF         𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                               𝑊 > 𝑊/𝑜 * *
AH
          BPR                       𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                 𝑊/𝑜 > 𝑊 * *                 𝑊 > 𝑊/𝑜 * *
          ALS                       𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊/𝑜 > 𝑊 * *                 𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *
          FunkSVD     𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *                 𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *   𝑊 > 𝑊/𝑜 * *


Table 5
Statistically significant HR@5 differences between recommendations without spammers (W/o) and
with spammers (W) across different user fairness ranges (* means 𝑝𝑣𝑎𝑙𝑢𝑒 < 0.03 and ** means
𝑝𝑣𝑎𝑙𝑢𝑒 ≤ 0.01). Cases where removing spammers reduces the HR@5 are shaded.


recommended items in their top-5 item list when spammers are removed from the system. Also,
we see performance improvement in terms of both RMSE and NDCG@5 scores for Amazon
non-mainstream users, i.e., the ones with low fairness, hence the ones whose ratings are very
far from the ones of the majority of the users. Non-mainstream users affected by spammers
represent 29% (resp. 25%) of benign users in AB (resp. AH). In a real-world scenario, these
percentages would translate into hundreds of thousands of users who would not be equally
satisfied by recommenders that are not robust to shilling attacks.


5. Conclusions
In this paper, we have taken a deeper look into how shilling attacks affect recommender systems
in a real-world scenario. For this, we conducted an in-depth exploration of the performance of
five well-known collaborative filtering algorithms on four different datasets.
    We saw similar trends among the performance of predictive and top-N recommenders: users
are exposed to better recommendations when spammers are excluded (RQ1). This highlights the
importance of further research on spammer detection and robust recommender systems. At the
same time, we question if the small differences in performance (albeit statistically significant)
would be evident to recommender systems’ users and whether metrics considered for assessment
which aggregate performance for all users could obfuscate users who are more deeply affected
by spammers. This leads us to explore differences in performance between mainstream vs.
non-mainstream users (RQ2). We saw that Amazon non-mainstream users are the ones most
affected by spam ratings according to both RMSE and NDCG@5.
    Based on the findings emerging from the analysis presented in this paper, it follows that future
work will be devoted to looking at other types of recommender algorithms, beyond collaborative
filtering-based, to see if the trends we have observed in our analysis remain. Moreover, we plan
also to test the effectiveness of adversarial training for recommender systems [3] under the
real-world attacks considered in this paper.




                                                                              55
Acknowledgements
This work has been supported by the National Science Foundation under Awards no. 1943370
and 1820685.


References
 [1] P.-A. Chirita, W. Nejdl, C. Zamfir, Preventing shilling attacks in online recommender sys-
     tems, in: Proceedings of the 7th annual ACM international workshop on Web information
     and data management, 2005, pp. 67–74.
 [2] M. Si, Q. Li, Shilling attacks against collaborative recommender systems: a review, Artificial
     Intelligence Review 53 (2020) 291–319.
 [3] Y. Deldjoo, T. D. Noia, F. A. Merra, A survey on adversarial recommender systems: from
     attack/defense strategies to generative adversarial networks, ACM Computing Surveys
     (CSUR) 54 (2021) 1–38.
 [4] R. Burke, M. O’Mahony, N. Hurley, Robust collaborative recommendation, in: Recom-
     mender systems handbook, Springer, 2015, pp. 961–995.
 [5] C. E. Seminario, Accuracy and robustness impacts of power user attacks on collaborative
     recommender systems, in: RecSys, ACM, 2013, pp. 447–450.
 [6] A. P. Sundar, F. Li, X. Zou, T. Gao, E. D. Russomanno, Understanding shilling attacks and
     their detection traits: a comprehensive survey, IEEE Access 8 (2020) 171703–171715.
 [7] M. P. O’Mahony, N. J. Hurley, G. C. Silvestre, Recommender systems: Attack types and
     strategies, in: AAAI, 2005, pp. 334–339.
 [8] M. Fang, N. Z. Gong, J. Liu, Influence function based data poisoning attacks to top-n
     recommender systems, in: Proceedings of The Web Conference 2020, 2020, pp. 3019–3025.
 [9] S. Kumar, B. Hooi, D. Makhija, M. Kumar, C. Faloutsos, V. S. Subrahmanian, REV2:
     fraudulent user prediction in rating platforms, in: WSDM, ACM, 2018, pp. 333–341.
[10] E. Serra, A. Shrestha, F. Spezzano, A. C. Squicciarini, Deeptrust: An automatic framework
     to detect trustworthy users in opinion-based systems, in: CODASPY, ACM, 2020, pp.
     29–38.
[11] S. K. Lam, J. Riedl, Shilling recommender systems for fun and profit, in: WWW, ACM,
     2004, pp. 393–402.
[12] C. E. Seminario, D. C. Wilson, Attacking item-based recommender systems with power
     items, in: Proceedings of the 8th ACM Conference on Recommender systems, 2014, pp.
     57–64.
[13] C. E. Seminario, D. C. Wilson, Nuke’em till they go: Investigating power user attacks
     to disparage items in collaborative recommenders, in: Proceedings of the 9th ACM
     Conference on Recommender Systems, 2015, pp. 293–296.
[14] S. Wadhwa, S. Agrawal, H. Chaudhari, D. Sharma, K. Achan, Data poisoning attacks against
     differentially private recommender systems, in: Proceedings of the 43rd International
     ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp.
     1617–1620.
[15] C. Lin, S. Chen, H. Li, Y. Xiao, L. Li, Q. Yang, Attacking recommender systems with




                                                56
     augmented user profiles, in: Proceedings of the 29th ACM International Conference on
     Information & Knowledge Management, 2020, pp. 855–864.
[16] Y. Deldjoo, T. Di Noia, E. Di Sciascio, F. A. Merra, How dataset characteristics affect the
     robustness of collaborative recommendation models, in: Proceedings of the 43rd Inter-
     national ACM SIGIR Conference on Research and Development in Information Retrieval,
     2020, pp. 951–960.
[17] A. Mukherjee, V. Venkataraman, B. Liu, N. S. Glance, What yelp fake review filter might
     be doing?, in: ICWSM, 2013, pp. 409–418.
[18] K. Weise, A lie detector test for online reviewers, in: Bloomberg BusinessWeek, 2011.
[19] S. Rayana, L. Akoglu, Collective opinion spam detection: Bridging review networks and
     metadata, in: SIGKDD, 2015, pp. 985–994.
[20] S. K. C., A. Mukherjee, On the temporal dynamics of opinion spamming: Case studies on
     yelp, in: WWW, 2016, pp. 369–379.
[21] J. McAuley, J. Leskovec, From amateurs to connoisseurs: modeling the evolution of user
     expertise through online reviews, in: WWW, 2013, pp. 897–908.
[22] A. Fayazi, K. Lee, J. Caverlee, A. Squicciarini, Uncovering crowdsourced manipulation of
     online reviews, in: SIGIR, 2015, pp. 233–242.
[23] M. D. Ekstrand, Lenskit for python: Next-generation software for recommender systems
     experiments, in: CIKM, 2020, pp. 2999–3006.
[24] A. Mnih, R. Salakhutdinov, Probabilistic matrix factorization, in: NIPS, 2008, pp. 1257–1264.
[25] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based collaborative filtering recommenda-
     tion algorithms, in: WWW, 2001, pp. 285–295.
[26] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering for implicit feedback datasets, in: IEEE
     ICDM, 2008, pp. 263–272.
[27] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems,
     Computer 42 (2009) 30–37.
[28] Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan, Large-scale parallel collaborative filtering for
     the netflix prize, in: AAIM, Springer, 2008, pp. 337–348.
[29] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme, Bpr: Bayesian personalized
     ranking from implicit feedback, arXiv preprint arXiv:1205.2618 (2012).
[30] H. Wang, N. Wang, D.-Y. Yeung, Collaborative deep learning for recommender systems, in:
     Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery
     and data mining, 2015, pp. 1235–1244.
[31] A. Paterek, Improving regularized singular value decomposition for collaborative filtering,
     in: Proceedings of KDD cup and workshop, volume 2007, 2007, pp. 5–8.
[32] R. Z. Li, J. Urbano, A. Hanjalic, Leave no user behind: Towards improving the utility of
     recommender systems for non-mainstream users, in: WSDM, ACM, 2021, pp. 103–111.
[33] S. Kumar, F. Spezzano, V. S. Subrahmanian, C. Faloutsos, Edge weight prediction in
     weighted signed networks, in: IEEE ICDM, 2016, pp. 221–230.
[34] S. K. Lam, J. Riedl, Shilling recommender systems for fun and profit, in: WWW, 2004, pp.
     393–402.




                                                57