=Paper= {{Paper |id=Vol-3924/short1 |storemode=property |title=The Role of Fake Users in Sequential Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-3924/short1.pdf |volume=Vol-3924 |authors=Filippo Bettello |dblpUrl=https://dblp.org/rec/conf/robustrecsys/Bettello24 }} ==The Role of Fake Users in Sequential Recommender Systems== https://ceur-ws.org/Vol-3924/short1.pdf
                         The Role of Fake Users in Sequential Recommender Systems⋆
                         Filippo Betello1
                         1
                             Sapienza University of Rome, Rome, Italy


                                           Abstract
                                           Sequential Recommender Systems (SRSs) are widely used to model user behavior over time, yet their robustness remains an under-
                                           explored area of research. In this paper, we conduct an empirical study to assess how the presence of fake users —who engage in
                                           random interactions, follow popular or unpopular items, or focus on a single genre —impacts the performance of SRSs in real-world
                                           scenarios. We evaluate two SRS models across multiple datasets, using established metrics such as Normalized Discounted Cumulative
                                           Gain (NDCG) and Rank Sensitivity List (RLS) to measure performance. While traditional metrics like NDCG remain relatively stable,
                                           our findings reveal that the presence of fake users severely degrades RLS metrics, often reducing them to near-zero values. These results
                                           highlight the need for further investigation into the effects of fake users on training data and emphasize the importance of developing
                                           more resilient SRSs that can withstand different types of adversarial attacks.

                                           Keywords
                                           Recommender Systems, Evaluation of Recommender Systems, Model Stability, Input Data Perturbation



                         1. Introduction                                                                                              number of bots is costly.
                                                                                                                                         Therefore, it is possible to create a limited number of
                         Recommender Systems (RSs) have become an essential part                                                      bots that can significantly influence the prominence of a
                         of our daily lives, helping users navigate the vast online                                                   particular item or category. By strategically deploying these
                         information landscape [1]. With the global expansion of                                                      bots, the visibility and perceived importance of the targeted
                         e-commerce services, social media platforms and streaming                                                    item or category can be enhanced, making it stand out more
                         services, these systems have become essential for person-                                                    compared to others. Imagine if, by using fake users, it were
                         alising content delivery and increasing user engagement                                                      possible to raise the profile of a certain category or product
                         [2].                                                                                                         or, conversely, to lower the profile of another. This scenario
                            Over the last several years, Sequential Recommender Sys-                                                  represents a form of unfair competition and is therefore
                         tems (SRSs) have gained significant popularity as an effec-                                                  crucial to study. Understanding how fake users behave in
                         tive method for modeling user behavior over time [3]. By                                                     controlled environments allows us to assess their impact
                         capitalizing on the temporal dependencies within users’                                                      on real users. It is also important to investigate whether
                         interaction sequences, these systems can make more pre-                                                      partially coordinated fake users can actively improve the
                         cise predictions about user preferences [4]. This approach                                                   performance or predictions of a particular category or item.
                         allows for a more nuanced understanding of user behav-                                                          In this paper, we investigate the impact of fake users
                         ior, leading to recommendations that are better tailored to                                                  on sequential recommendation systems. Specifically, we
                         individual needs and preferences. As a result, SRSs have be-                                                 investigate how the inclusion of a certain percentage of
                         come a critical component in various applications, ranging                                                   bots affects the performance of real users. These bots are
                         from e-commerce [5] to music recommendation [6], where                                                       programmed to deal with random items, popular items, un-
                         understanding and anticipating user preferences is key to                                                    popular items and items within the same category.
                         enhancing user experience and engagement.                                                                       Our experiments focus on the following research ques-
                            In recent years, the prevalence of bots (fake users) on                                                   tions:
                         social media platforms has increased dramatically [7]. It
                         is estimated that Amazon, for example, spends 2% of its                                                           • RQ1: How does the value of standard metrics such
                         net revenue each year fighting counterfeiting [8]. While                                                            as NDCG change for real users depending on the
                         several techniques have been identified to counteract this                                                          type and increasing number of fake users?
                         growing problem [9, 10], a detailed investigation in the area                                                     • RQ2:How do recommendation lists for real users
                         of sequential recommendation systems is still lacking. Li                                                           differ from those generated without fake users?
                         et al. [11] aims to fill this gap by investigating the impact of                                                  • RQ3: Are more or less popular items favoured by
                         bot-generated data on sequential recommendation models.                                                             the presence of fake users with certain types of in-
                         Specifically, it seeks to determine an optimal bot-generation                                                       teractions?
                         budget and analyze its impact on popular matrix factoriza-
                         tion models. Indeed, controlling and maintaining a large                                                        We evaluate our hypothesis using two different models,
                                                                                                                                      SASRec [12] and GRU4Rec [13], and by employing four
                         RobustRecSys: Design, Evaluation, and Deployment of Robust Recom-                                            different datasets, namely MovieLens 1M, MovieLens 100k
                         mender Systems Workshop @ RecSys 2024, 18 October, 2024, Bari, Italy.                                        [14], Foursquare New York City and Foursquare Tokyo [15].
                         ⋆
                           This work was partially supported by projects FAIR (PE0000013) and
                           SERICS (PE00000014) under the MUR National Recovery and Re-
                           silience Plan funded by the European Union - NextGenerationEU.                                             2. Related Work
                           Supported also by the ERC Advanced Grant 788893 AMDROMA,
                           EC H2020RIA project “SoBigData++” (871042), PNRR MUR project
                           IR0000013-SoBigData.it. This work has been supported by the project                                        2.1. Sequential Recommender Systems
                           NEREO (Neural Reasoning over Open Data) project funded by the
                                                                                                                                      Sequential recommendation systems (SRSs) use algorithms
                           Italian Ministry of Education and Research (PRIN) Grant no. 2022AEF-
                           HAZ.                                                                                                       that analyze a user’s past interactions with items to pro-
                         $ betello@diag.uniroma1.it (F. Betello)                                                                      vide personalized recommendations over time. These sys-
                          0009-0006-0945-9688 (F. Betello)                                                                           tems have found widespread application in areas such as
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
e-commerce [16, 5], social media [17, 18], and music stream-     Suppose we have a set of 𝑛 users, represented as 𝒰 ⊂ N+ ,
ing services [19, 20, 6]. Unlike traditional recommender         and a corresponding set of 𝑛 items, represented as ℐ ⊂ N+ .
systems, SRSs take into account the sequence and timing of       Each user 𝑢 ∈ 𝒰 is associated with a time-ordered sequence
user interactions, resulting in more precise predictions of      of interactions 𝑆𝑢 = [𝑠1 , . . . , 𝑠𝐿𝑢 ], where each 𝑠𝑖 ∈ ℐ
user preferences and behaviors [4].                              denotes the 𝑖-th item with which the user has interacted.
   Various methods have been developed to implement SRSs.        The length of this sequence, 𝐿𝑢 , is greater than 1 and varies
Early approaches used Markov Chain models [21, 22], which,       from user to user.
despite their simplicity, struggled with capturing complex          A sequential recommendation system (SRS), denoted
dependencies in long-term sequences. More recently, Re-          ℳ, processes the sequence up to the 𝐿-th item, denoted
current Neural Networks (RNNs) have become prominent             𝑆𝑢𝐿 = [𝑠1 , . . . , 𝑠𝐿 ], to suggest the next item, 𝑠𝐿+1 . The
in this domain [23, 13, 24]. RNNs encode a user’s historical     recommendation output, 𝑟𝐿+1 = ℳ(𝑆𝑢𝐿 ) ∈ R𝑚 , is a score
preferences into a vector that is updated at each time step      distribution over all possible items. This distribution is used
to predict the next item in the sequence. However, RNNs          to create a ranked list of items, predicting the most likely
can encounter difficulties with long-term dependencies and       interactions for user 𝑢 in the next step, 𝐿 + 1.
generating diverse recommendations.
   The attention mechanism [25] has introduced another           3.2. Fake user design
promising approach. Models like SASRec [12] and
BERT4Rec [26] leverage this mechanism to dynamically             Given that each item in the set ℐ has a popularity value
weight different parts of the sequence, capturing key fea-       determined by user interactions, we designed four types of
tures to enhance prediction accuracy.                            fake user scenarios:
   Additionally, Graph Neural Networks have recently                  • Random: Items are randomly sampled from the
gained traction in the recommendation system field, particu-            entire set ℐ. Formally, each item 𝑠𝑖 in the sequence
larly within the sequential domain [27, 28]. These networks             𝑆𝑢 is selected with probability |ℐ| 1
                                                                                                              .
excel at modeling complex relationships and dependencies,             • Popularity: Items are sampled according to
further advancing the capabilities of SRSs [29, 30, 31].                a popularity-based probability distribution 𝑃pop ,
                                                                        where the probability of selecting item 𝑠𝑖 is pro-
2.2. Training Perturbations                                             portional to its popularity 𝑝𝑖 .
                                                                      • Unpopularity: Similar to the popularity-based sce-
Robustness is an important aspect of SRSs as they are vul-              nario, but with a distribution 𝑃unpop that inversely
nerable to noisy and incomplete data. [32, 33] investigated             favors popular items. Here, the probability of select-
the effects of removing items at the beginning, middle and              ing item 𝑠𝑖 is inversely proportional to its popularity,
end of a sequence of temporally ordered items and found                 Pr(𝑠𝑖 ) ∝ 𝑝1𝑖 , favoring less popular items.
that removing items at the end of the sequence significantly          • Genre: In this scenario, items are sampled exclu-
affected all performances.                                              sively from a specific genre. It is only applied to the
   Yin et al. [34] design an attacker-chosen targeted item in           ML datasets.
federated recommender systems without requiring knowl-
                                                                    These fake users sequences will contain unique items to
edge about user-item rating data, user attributes, or the
                                                                 ensure there are no repetitions. While the first scenario
aggregation rule used by the server. While studies are being
                                                                 involves users acting independently without any sense of
conducted in other areas of recommendation [35, 36] and
                                                                 cooperation, the middle two scenarios introduce a level of
several techniques have been identified to counteract this
                                                                 implicit cooperation. Specifically, users in these scenarios
growing problem [9, 10], a detailed investigation in the area
                                                                 tend to converge on viewing either highly popular or highly
of sequential recommendation systems is still lacking.
                                                                 unpopular items, reflecting a collective behavior. The aver-
   Li et al. [11] aim to address this issue by examining how
                                                                 age length of the sequences will be the same as that of real
bot-generated data affects sequential recommendation mod-
                                                                 users. The proportion of synthetic users will vary, compris-
els. Their research focuses on finding the optimal budget for
                                                                 ing 1%, 5%, 10%, 15% and 20% of the original dataset. The
bot generation and assessing its influence on widely used
                                                                 fake users are only used in the training data, leaving the
matrix factorization models. Indeed, controlling and main-
                                                                 test data unaffected.
taining a large number of bots is costly. Previous research
has proposed attacks using a limited number of users and
clustering models [37], but these have not been extensively      3.3. Models
studied in the context of sequential recommendations.            In our study, we use two different architectures to validate
   To the best of our knowledge, our research is completely      our results:
novel and breaks new ground. It explores the role that fake
users might play in influencing real users. This study aims           • SASRec [12], which uses self-attention mechanisms
to shed light on the potential impact that fake users could             to evaluate the importance of each interaction be-
have on the behaviour, opinions and interactions of real                tween the user and the item.
users within sequential recommendation systems.                       • GRU4Rec [13], a RNN model that uses gated recur-
                                                                        rent units (GRUs) [38] to improve prediction accu-
                                                                        racy.
3. Methodology                                                      We chose these two models because they have demon-
                                                                 strated exceptional performance in numerous benchmarks
3.1. Background                                                  and are widely cited in the academic literature. Moreover,
The main objective of sequential recommendation systems          since one model is based on attention mechanisms and the
is to predict the user’s next interaction in a given sequence.   other on RNNs, their different network operations make it
                                                                 particularly interesting to evaluate their behaviour.
Table 1                                                                        rank lists. Higher values indicate that the items in
Dataset statistics after pre-processing; users and items not having            the two lists are arranged similarly:
at least 5 interactions are removed. Avg. and Med. refer to the
Average and Median of Actions
                            User
                                  , respectively.                                                                 𝑘
                                                                                                         1 − 𝑝 ∑︁ 𝑑−1 |𝑋[1 : 𝑑] ∩ 𝑌 [1 : 𝑑]|
                                                                               FRBO(X, Y)@k =                    𝑝
  Name       Users    Items    Interactions    Density   Avg.   Med.                                     1 − 𝑝𝑘                 𝑑
                                                                                                                 𝑑=1
  FS-NYC      1,083    9,989     179,468        1.659    165    116
  FS-TKY      2,293   15,177     494,807        1.421    215    146     All metrics are computed “@𝑘”, meaning that we use just
  ML-100k      943    1,349      99,287         7.805    105    64      the first 𝑘 recommended items in the output ranking, with
  ML-1M       6,040    3,416     999,611        4.845    165    96
                                                                        𝑘 ∈ {10, 20}.


3.4. Datasets                                                           3.6. Experimental Setup
We use four different datasets:                                         All experiments were performed on a single NVIDIA RTX
   MovieLens [14]: Frequently utilized to evaluate recom-               A6000 with 10752 CUDA cores and 48 GB of RAM. We train
mender systems, this benchmark dataset is employed in our               the models for 500 epochs, fixing the batch size to 128 and
study using both the 100K and 1M versions.                              by using the Adam optimizer [40] with a lr of 10−3 . To run
   Foursquare [15]: This dataset includes check-in data                 our experiments, we use the EasyRec library [41].
from New York City and Tokyo, collected over a span of
roughly ten months.                                                     4. Results
   The statistics for all the datasets are shown in Table 1. Our
pre-processing technique adheres to recognised principles,              Our experiments aim to address the following research ques-
such as treating ratings as implicit, using all interactions            tions:
without regard to the rating value, and deleting users and
things with fewer than 5 interactions [12, 26]. For testing, as              • RQ1: How does the value of standard metrics such
in [26, 12], we keep the most recent interaction for each user,                as NDCG change for real users depending on the
while for validation, we keep the second to last action. The                   type and increasing number of fake users?
remaining interactions are added to the training set, which                  • RQ2:How do recommendation lists for real users
is the only one affected by the fake users perturbation.                       differ from those generated without fake users?
   We focus exclusively on genres in the ML dataset, as                      • RQ3: Are more or less popular items favoured by
it is the only dataset that contains category information.                     the presence of fake users with certain types of in-
Specifically, we select only those categories that represent                   teractions?
more than 5% of the total items in the dataset.
                                                                        4.1. RQ1: Impact of Fake Users on Standard
3.5. Evaluation                                                              Metrics for Real Users
We only carry out the evaluation on the part of the real
                                                                        In Figure 1 the results for all datasets considered are shown
users. To evaluate the performance of the models, we
                                                                        for both models using the standard metrics.
employ traditional evaluation metrics used for Sequential
                                                                           Regarding the SASRec shown in Figure 1d for the FS-NYC
Recommendation: Precision, Recall, MAP and NDCG. More-
                                                                        dataset, we observe that the performance tends to improve
over, to investigate the stability of the recommendation
                                                                        slightly for the unpopular scenario for the NDCG@20 met-
models, we employ the Rank List Sensitivity (RLS) [33]: it
                                                                        ric, while for the popular and random interaction there is a
compares two lists of rankings 𝒳 and 𝒴, one derived from
                                                                        gradual but consistent decline in performance. Regarding
the model trained under standard conditions and the other
                                                                        genre interactions in the ML-1M dataset, shown in Figure 1a,
derived from the model trained with perturbed data.
                                                                        all genres appear to positively impact the NDCG metric. A
  Given these two rankings, and a similarity function 𝑠𝑖𝑚
                                                                        more detailed analysis using RLS metrics is presented in
between them, we can formalise the RLS measure as
                                                                        Section 4.2.
                               |𝒳 |                                        In the case of GRU4Rec figs. 1b and 1c, there is a slow but
                          1 ∑︁                                          steady decline in performance for the ML-100k and FS-TKY
              RLS =            sim(𝑅𝑋𝑘 , 𝑅𝑌𝑘 )                    (1)
                         |𝒳 |                                           datasets, with the decline occurring in a predictable manner
                               𝑘=1
                                                                        for both metrics considered, as the percentage of fake users
  where 𝑋𝑘 and 𝑌𝑘 represent the 𝑘-th ranking inside 𝒳
                                                                        increase.
and 𝒴 respectively.
  RLS’s similarity measure can be chosen from two possible
options:                                                                4.2. RQ2: Analysis of Recommendation Lists
                                                                             Generated for Real Users
     • Jaccard Similarity (JAC) [39] is a normalized
       measure of the similarity of the contents of two sets.           In Figure 2 we present the RLS metrics for all datasets consid-
       A model is stable if its Jaccard score is close to 1.            ered, comparing the performance of the two models. These
                                                                        metrics are derived from predictions made by the standard
                                              |𝑋 ∩ 𝑌 |                  model - without fake users - and predictions made after
                       JAC(X, Y) =                                (2)
                                              |𝑋 ∪ 𝑌 |                  training with fake users.
                                                                           When analysing the SASRec model on the ML-100k
     • Finite-Rank-Biased Overlap (FRBO) [32]
                                                                        dataset (fig. 2a), SASRec shows minimal performance degra-
       measures the similarity of orderings between two
                                                                        dation. Conversely, the FS-TKY dataset gives less favourable
                                                                        results, with significantly worse performance and a Jaccard
                                                  Recall_@10 vs. Percentage for Different Sampling Methods
              0.780
                                                                                                                    remains low at around 0.35 (fig. 2c). For ML-100k and genre
                                                                                                                    interactions, the degradation in performance is consistent
              0.775
                                                                                                                    across all genres, with the degradation worsening as the
                                                                                                                    number of fake users increases.
              0.770
                                                                                                                       The evaluation metrics for Foursquare show a significant
 Recall_@10




              0.765
                                                                                                                    drop in performance compared to other datasets, highlight-
                                                                                                                    ing the limitations of the dataset [42].
                             Action
              0.760          Adventure
                             Comedy
                                                                                                                       An additional observation is that as the number of fake
                             Drama
                             Romance                                                                                users increases, the performance of the model generally de-
                             Thriller
              0.755          Base performances                                                                      teriorates. This suggests that while adding more fake users
                       0.01                      0.05                    0.10                     0.15       0.20
                                                                       % of fake users                              tends to reduce the effectiveness of the lists generated, man-
                                           (a) NDCG@20 ML-1M SASRec                                                 aging a higher number of fake users becomes increasingly
                                                  NDCG_@20 vs. Percentage for Different Sampling Methods            difficult.
              0.74



              0.73                                                                                                  4.3. RQ3: Influence of Fake User
                                                                                                                         Interactions on Popular and Unpopular
              0.72
                                                                                                                         Items
 NDCG_@20




              0.71
                                                                                                                    We investigated whether popular and unpopular items were
                                                                                                                    favoured in recommendation lists by analysing the percent-
              0.70
                         Random                                                                                     age of the top 20 items recommended to each user. Our
                         Unpopular
              0.69
                         Popular
                         Base performances
                                                                                                                    results show that unpopular items were consistently under-
                      0.01                       0.05                    0.10
                                                                       % of fake users
                                                                                                  0.15       0.20   represented in these lists. This suggests that more users, a
                                                                                                                    wider range of items, or consideration of a larger number of
                                           (b) MAP@10 FS-TKY GRU4Rec                                                top positions (e.g. top 100 items) may be necessary to gain
                                                  NDCG_@20 vs. Percentage for Different Sampling Methods
                                                                                                                    a better understanding. On the other hand, in the ML-100k
              0.430
                                                                                                                    dataset, the percentage of popular items in the recommen-
              0.425                                                                                                 dation lists without any user-specific adjustments is 5.73%.
              0.420                                                                                                 The introduction of popular users barely affects this per-
              0.415                                                                                                 centage (5.68%), while the inclusion of non-popular users
 NDCG_@20




              0.410
                                                                                                                    slightly reduces it to 5.45%.
                                                                                                                       These results suggest significant opportunities for future
              0.405
                                                                                                                    research, such as focusing on specific categories of items to
              0.400
                             Random
                             Unpopular
                                                                                                                    either improve or reduce recommendation performance.
              0.395          Popular
                             Base performances
                       0.01                      0.05                    0.10                     0.15       0.20
                                                                       % of fake users
                                                                                                                    5. Conclusion
                                        (c) NDCG@20 ML-100k GRU4Rec
                                                   MAP_@10 vs. Percentage for Different Sampling Methods            In this work we investigated the impact of fake users on real
              0.186
                                                                                                                    users. These fake users can have random interactions, inter-
              0.184                                                                                                 act with popular or unpopular items, and are only added to
                                                                                                                    the training set at different percentages of the total dataset.
              0.182                                                                                                 The results showed that although the standard metrics were
                                                                                                                    not significantly affected, with random perturbations caus-
 MAP_@10




              0.180
                                                                                                                    ing the most significant degradation in performance, the
              0.178                                                                                                 output lists generated under these perturbations were sig-
                             Random                                                                                 nificantly different from the standard lists trained without
                             Unpopular
              0.176
                             Popular
                             Base performances
                                                                                                                    any perturbations. These differences, measured using rank-
                       0.01                      0.05                    0.10                     0.15       0.20   ing list sensitivity metrics, in particular Jaccard and FRBO,
                                                                       % of fake users
                                                                                                                    showed that in the case of MovieLens about half of the list
                                          (d) NDCG@20 FS-NYC SASRec                                                 elements were shared, whereas in the case of Foursquare
Figure 1: Plots of various metrics for all the datasets considered                                                  almost no elements were considered. Furthermore, the pro-
as the percentage of fake users increases. The baseline is shown                                                    portion of popular and unpopular items in recommendations
as a horizontal solid line, while other lines show the metrics                                                      for real users was not affected by the presence of fake users.
as the percentage of fake users changes for the three scenarios                                                        This study opens up future research directions in a num-
considered.                                                                                                         ber of ways. First, it would be valuable to compare the
                                                                                                                    number of recommended items - categorised as popular,
                                                                                                                    unpopular and genre-specific - using a standard training
index close to 0, indicating that the generated lists have                                                          model with those generated by a model trained on fake
almost no overlap with the original lists (fig. 2b).                                                                users. This comparison could reveal better significant dif-
   Figures Figures 2c and 2d show the performance on the                                                            ferences in recommendation patterns. Second, the creation
ML-100k dataset for genre sampling and the ML-1M dataset                                                            of a set of fake users could allow to systematically elevate
for the other sampling methods. On the ML-1M dataset, the                                                           or downgrade certain categories over time. Third, studying
performance is relatively good, although the Jaccard index                                                          datasets with shorter interaction sequences, such as those
                                                                                                                    from Amazon [43], could provide new insights into user
                           Comparison of RLS_FRBO @20 for different sampling methods
                                                                                          Popular     different optimisation objectives [47].
                                                                                          Unpopular
           0.800                                                                          Random

           0.775
                                                                                                      References
           0.750

                                                                                                       [1] G. Adomavicius, A. Tuzhilin, Toward the next genera-
 FRBO@20




           0.725

           0.700                                                                                           tion of recommender systems: A survey of the state-
                                                                                                           of-the-art and possible extensions, IEEE transactions
           0.675
                                                                                                           on knowledge and data engineering 17 (2005) 734–749.
           0.650
                                                                                                       [2] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning based
           0.625                                                                                           recommender system: A survey and new perspectives,
                   0.025       0.050     0.075     0.100
                                                  % of fake users
                                                                  0.125   0.150   0.175      0.200
                                                                                                           ACM Comput. Surv. 52 (2019). URL: https://doi.org/10.
                                                                                                           1145/3285029. doi:10.1145/3285029.
                            (a) RLS-FRBO ML100k SASRec
                                                                                                       [3] M. Quadrana, P. Cremonesi, D. Jannach, Sequence-
                           Comparison of RLS_FRBO @20 for different sampling methods
                                                                                          Popular          aware recommender systems, ACM Computing Sur-
           0.013                                                                          Unpopular
                                                                                          Random           veys (CSUR) 51 (2018) 1–36.
           0.012                                                                                       [4] S. Wang, L. Hu, Y. Wang, L. Cao, Q. Z. Sheng,
                                                                                                           M. Orgun, Sequential recommender systems: chal-
           0.011
                                                                                                           lenges, progress and prospects, arXiv preprint
 FRBO@20




           0.010                                                                                           arXiv:2001.04830 (2019).
           0.009
                                                                                                       [5] H. Hwangbo, Y. S. Kim, K. J. Cha, Recommendation
                                                                                                           system development for fashion retail e-commerce,
           0.008                                                                                           Electronic Commerce Research and Applications 28
           0.007                                                                                           (2018) 94–101.
                   0.025       0.050     0.075     0.100          0.125   0.150   0.175      0.200     [6] D. Afchar, A. Melchiorre, M. Schedl, R. Hennequin,
                                                  % of fake users
                                                                                                           E. Epure, M. Moussallam, Explainability in music
                            (b) RLS-FRBO FS-TKY SASRec                                                     recommender systems, AI Magazine 43 (2022) 190–
                            Comparison of RLS_JAC @20 for different sampling methods                       208.
                                                                                          Popular
           0.365
                                                                                          Unpopular    [7] E. Ferrara, O. Varol, C. Davis, F. Menczer, A. Flammini,
                                                                                          Random
                                                                                                           The rise of social bots, Communications of the ACM
           0.360
                                                                                                           59 (2016) 96–104.
           0.355
                                                                                                       [8] M. Daniels, Amazon says its stopped 700k coun-
                                                                                                           terfeiters from making accounts last year, 2024.
 JAC@20




           0.350                                                                                           URL:        https://www.modernretail.co/technology/
                                                                                                           amazon-says-its-stopped-700k-counterfeiters-from-making-accounts-last
           0.345                                                                                       [9] M. Mendoza, M. Tesconi, S. Cresci, Bots in social and
                                                                                                           interaction networks: detection and impact estimation,
           0.340
                                                                                                           ACM Transactions on Information Systems (TOIS) 39
                   0.025       0.050     0.075     0.100          0.125   0.150   0.175      0.200
                                                  % of fake users                                          (2020) 1–32.
                                                                                                      [10] M. Mazza, S. Cresci, M. Avvenuti, W. Quattrociocchi,
                             (c) RLS-JAC ML-1M GRU4Rec
                           Comparison of RLS_FRBO @20 for different sampling methods
                                                                                                           M. Tesconi, Rtbust: Exploiting temporal patterns for
           0.52                                                                           Action           botnet detection on twitter, in: Proceedings of the 10th
                                                                                          Adventure
                                                                                          Comedy           ACM conference on web science, 2019, pp. 183–192.
                                                                                          Drama
           0.50                                                                           Romance
                                                                                          Thriller
                                                                                                      [11] H. Li, S. Di, L. Chen, Revisiting injective attacks on rec-
           0.48                                                                                            ommender systems, Advances in Neural Information
                                                                                                           Processing Systems 35 (2022) 29989–30002.
 FRBO@20




           0.46
                                                                                                      [12] W.-C. Kang, J. McAuley, Self-attentive sequential rec-
           0.44                                                                                            ommendation, in: 2018 IEEE international conference
                                                                                                           on data mining (ICDM), IEEE, 2018, pp. 197–206.
           0.42
                                                                                                      [13] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk,
           0.40                                                                                            Session-based recommendations with recurrent neural
                   0.025      0.050      0.075    0.100          0.125    0.150   0.175      0.200         networks, 2016. arXiv:1511.06939.
                                                 % of fake users
                                                                                                      [14] F. M. Harper, J. A. Konstan, The movielens datasets:
                           (d) RLS-FRBO ML-100k GRU4Rec                                                    History and context, ACM Trans. Interact. Intell. Syst.
                                                                                                           5 (2015). URL: https://doi.org/10.1145/2827872. doi:10.
Figure 2: Plots of RLS metrics for all the datasets considered as
                                                                                                           1145/2827872.
the percentage of fake users increases. The metrics are shown
as the percentage of fake users changes for the three scenarios
                                                                                                      [15] D. Yang, D. Zhang, V. W. Zheng, Z. Yu, Modeling user
considered.                                                                                                activity preference by leveraging user spatial temporal
                                                                                                           characteristics in lbsns, IEEE Transactions on Systems,
                                                                                                           Man, and Cybernetics: Systems 45 (2014) 129–142.
                                                                                                      [16] J. B. Schafer, J. A. Konstan, J. Riedl, E-commerce recom-
behaviour and recommendation effectiveness. Finally, re-
                                                                                                           mendation applications, Data Mining and Knowledge
search should focus on building resilient models for these
                                                                                                           Discovery 5 (2001) 115–153.
types of perturbations: the solution could lie in different
                                                                                                      [17] I. Guy, N. Zwerdling, I. Ronen, D. Carmel, E. Uziel, So-
training strategies[44], robust loss functions [45, 46], or
                                                                                                           cial media recommendation based on people and tags,
     in: Proceedings of the 33rd International ACM SIGIR             3459637.3482242. doi:10.1145/3459637.3482242.
     Conference on Research and Development in Infor-           [29] S. Wu, F. Sun, W. Zhang, X. Xie, B. Cui, Graph neural
     mation Retrieval, SIGIR ’10, Association for Comput-            networks in recommender systems: a survey, ACM
     ing Machinery, New York, NY, USA, 2010, p. 194–201.             Computing Surveys 55 (2022) 1–37.
     URL: https://doi.org/10.1145/1835449.1835484. doi:10.      [30] A. Purificato, G. Cassarà, P. Liò, F. Silvestri, Sheaf neu-
     1145/1835449.1835484.                                           ral networks for graph-based recommender systems,
[18] F. Amato, V. Moscato, A. Picariello, G. Sperlí, Recom-          arXiv preprint arXiv:2304.09097 (2023).
     mendation in social media networks, in: 2017 IEEE          [31] A. Purificato, F. Silvestri, Eco-aware graph neural
     Third International Conference on Multimedia Big                networks for sustainable recommendations, arXiv
     Data (BigMM), IEEE, 2017, pp. 213–216.                          preprint arXiv:2410.09514 (2024).
[19] M. Schedl, P. Knees, B. McFee, D. Bogdanov, M. Kamin-      [32] F. Betello, F. Siciliano, P. Mishra, F. Silvestri, Inves-
     skas, Music recommender systems, Recommender                    tigating the robustness of sequential recommender
     Systems Handbook (2015) 453–492.                                systems against training data perturbations, in: Euro-
[20] M. Schedl, H. Zamani, C.-W. Chen, Y. Deldjoo, M. Elahi,         pean Conference on Information Retrieval, Springer,
     Current challenges and visions in music recommender             2024, pp. 205–220.
     systems research, International Journal of Multimedia      [33] S. Oh, B. Ustun, J. McAuley, S. Kumar, Rank list sen-
     Information Retrieval 7 (2018) 95–116.                          sitivity of recommender systems to interaction per-
[21] F. Fouss, A. Pirotte, M. Saerens, A novel way of comput-        turbations, in: Proceedings of the 31st ACM Inter-
     ing similarities between nodes of a graph, with appli-          national Conference on Information & Knowledge
     cation to collaborative recommendation, in: The 2005            Management, CIKM ’22, Association for Computing
     IEEE/WIC/ACM International Conference on Web In-                Machinery, New York, NY, USA, 2022, p. 1584–1594.
     telligence (WI’05), IEEE, 2005, pp. 550–556.                    URL: https://doi.org/10.1145/3511808.3557425. doi:10.
[22] F. Fouss, S. Faulkner, M. Kolp, A. Pirotte, M. Saerens,         1145/3511808.3557425.
     et al., Web recommendation system based on a markov-       [34] M. Yin, Y. Xu, M. Fang, N. Z. Gong, Poisoning federated
     chainmodel., in: ICEIS (4), 2005, pp. 56–63.                    recommender systems with fake users, in: Proceedings
[23] T. Donkers, B. Loepp, J. Ziegler, Sequential user-based         of the ACM on Web Conference 2024, 2024, pp. 3555–
     recurrent neural network recommendations, in: Pro-              3565.
     ceedings of the Eleventh ACM Conference on Recom-          [35] G. Trappolini, V. Maiorca, S. Severino, E. Rodolà, F. Sil-
     mender Systems, RecSys ’17, Association for Comput-             vestri, G. Tolomei, Sparse vicious attacks on graph
     ing Machinery, New York, NY, USA, 2017, p. 152–160.             neural networks, IEEE Transactions on Artificial Intel-
     URL: https://doi.org/10.1145/3109859.3109877. doi:10.           ligence 5 (2024) 2293–2303. doi:10.1109/TAI.2023.
     1145/3109859.3109877.                                           3319306.
[24] M. Quadrana, A. Karatzoglou, B. Hidasi, P. Cremonesi,      [36] Z. Chen, F. Silvestri, J. Wang, Y. Zhang, G. Tolomei,
     Personalizing session-based recommendations with                The dark side of explanations: Poisoning recom-
     hierarchical recurrent neural networks, in: Proceed-            mender systems with counterfactual examples, in:
     ings of the Eleventh ACM Conference on Recom-                   Proceedings of the 46th International ACM SIGIR Con-
     mender Systems, RecSys ’17, Association for Comput-             ference on Research and Development in Informa-
     ing Machinery, New York, NY, USA, 2017, p. 130–137.             tion Retrieval, SIGIR ’23, Association for Computing
     URL: https://doi.org/10.1145/3109859.3109896. doi:10.           Machinery, New York, NY, USA, 2023, p. 2426–2430.
     1145/3109859.3109896.                                           URL: https://doi.org/10.1145/3539618.3592070. doi:10.
[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,                1145/3539618.3592070.
     L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Atten-    [37] Y. Wang, Y. Liu, Q. Wang, C. Wang, Clusterpoison: Poi-
     tion is all you need, Advances in neural information            soning attacks on recommender systems with limited
     processing systems 30 (2017).                                   fake users, IEEE Communications Magazine (2024).
[26] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang,    [38] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau,
     Bert4rec: Sequential recommendation with bidirec-               F. Bougares, H. Schwenk, Y. Bengio, Learning phrase
     tional encoder representations from transformer, in:            representations using rnn encoder-decoder for statis-
     Proceedings of the 28th ACM international conference            tical machine translation, 2014. arXiv:1406.1078.
     on information and knowledge management, 2019, pp.         [39] P. Jaccard, The distribution of the flora in the alpine
     1441–1450.                                                      zone. 1, New Phytologist 11 (1912) 37–50.
[27] J. Chang, C. Gao, Y. Zheng, Y. Hui, Y. Niu, Y. Song,       [40] D. P. Kingma, J. Ba, Adam: A method for stochastic
     D. Jin, Y. Li, Sequential recommendation with graph             optimization, 2017. arXiv:1412.6980.
     neural networks, in: Proceedings of the 44th Interna-      [41] F. Betello, A. Purificato, F. Siciliano, G. Trappolini,
     tional ACM SIGIR Conference on Research and Devel-              A. Bacciu, N. Tonellotto, F. Silvestri, A reproducible
     opment in Information Retrieval, SIGIR ’21, Associa-            analysis of sequential recommender systems, IEEE
     tion for Computing Machinery, New York, NY, USA,                Access (2024).
     2021, p. 378–387. URL: https://doi.org/10.1145/3404835.    [42] A. Klenitskiy, A. Volodkevich, A. Pembek, A. Vasilev,
     3462968. doi:10.1145/3404835.3462968.                           Does it look sequential? an analysis of datasets for
[28] Z. Fan, Z. Liu, J. Zhang, Y. Xiong, L. Zheng, P. S.             evaluation of sequential recommendations, arXiv
     Yu, Continuous-time sequential recommendation with              preprint arXiv:2408.12008 (2024).
     temporal graph collaborative transformer, in: Pro-         [43] Y. Hou, J. Li, Z. He, A. Yan, X. Chen, J. McAuley, Bridg-
     ceedings of the 30th ACM International Conference               ing language and items for retrieval and recommenda-
     on Information & Knowledge Management, CIKM ’21,                tion, arXiv preprint arXiv:2403.03952 (2024).
     Association for Computing Machinery, New York, NY,         [44] A. Petrov, C. Macdonald, Effective and efficient train-
     USA, 2021, p. 433–442. URL: https://doi.org/10.1145/            ing for sequential recommendation using recency sam-
     pling, in: Proceedings of the 16th ACM Conference
     on Recommender Systems, 2022, pp. 81–91.
[45] M. S. Bucarelli, L. Cassano, F. Siciliano, A. Mantrach,
     F. Silvestri, Leveraging inter-rater agreement for classi-
     fication in the presence of noisy labels, in: Proceedings
     of the IEEE/CVF Conference on Computer Vision and
     Pattern Recognition, 2023, pp. 3439–3448.
[46] F. A. Wani, M. S. Bucarelli, F. Silvestri, Learning with
     noisy labels through learnable weighting and centroid
     similarity, in: 2024 International Joint Conference on
     Neural Networks (IJCNN), IEEE, 2024, pp. 1–9.
[47] A. Bacciu, F. Siciliano, N. Tonellotto, F. Silvestri, Inte-
     grating item relevance in training loss for sequential
     recommender systems, in: Proceedings of the 17th
     ACM Conference on Recommender Systems, 2023, pp.
     1114–1119.