=Paper= {{Paper |id=Vol-3924/short7 |storemode=property |title=Robust Solutions for Ranking Variability in Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-3924/short7.pdf |volume=Vol-3924 |authors=Bonifacio Marco Francomano,Federico Siciliano,Fabrizio Silvestri |dblpUrl=https://dblp.org/rec/conf/robustrecsys/FrancomanoSS24 }} ==Robust Solutions for Ranking Variability in Recommender Systems== https://ceur-ws.org/Vol-3924/short7.pdf
                         Robust Solutions for Ranking Variability in Recommender
                         Systems⋆
                         Bonifacio Marco Francomano1 , Federico Siciliano1 and Fabrizio Silvestri1
                         1
                             Sapienza University of Rome, Rome, Italy


                                           Abstract
                                           In the field of recommender systems, an important issue within the current state-of-the-art is the inconsistency in item rankings
                                           produced by models initialized with different weight seeds. Despite these models achieve convergence and obtain similar average
                                           performance metrics, their item rankings differ significantly. This phenomenon is quantitavely demonstrated using metrics such as
                                           Rank List Sensitivity (RLS) and Normalized Discounted Cumulative Gain (NDCG) across different model pairs. In this paper, we reaffirm
                                           the existence of this problem and provide new insights by analysing models with common item embeddings but different network
                                           initialization, and different item embeddings but common network initialization, to identify which network components most influence
                                           ranking variability. To address the general issue, we propose an ensemble approach that averages the output of multiple models. Our
                                           ensemble maintains the NDCG of the original model while significantly improving ranking stability: the RLS FRBO@10 value shows an
                                           approximate increase of 30.82%.

                                           Keywords
                                           Recommender Systems, Evaluation of Recommender Systems, Model Stability



                         1. Introduction                                                                                              Table 1
                                                                                                                                      Ranking Initialization 1
                         In the recent years, neural sequential recommender systems
                         have gained importance due to their ability to model user                                                                           Rank    Item ID
                         behavior over time, providing more accurate and person-                                                                                 1   Item A
                         alized recommendations[1, 2]. Unlike traditional recom-                                                                                 2   Item B
                         mender systems that consider user preferences in a static                                                                               3   Item C
                         context, neural sequential recommender systems capture                                                                                  4   Item D
                         the temporal dynamics of user interactions[3]. This capabil-                                                                            5   Item E
                         ity is central in domains such as e-commerce[4], streaming
                         services[5], and social media[6]. By analyzing the sequence
                         of items a user interacts with, these systems can predict                                                    Table 2
                         future preferences[7].                                                                                       Ranking Initialization 2
                            Despite the advancements in neural sequential recom-
                         mender systems, a significant issue persists: the vari-                                                                             Rank    Item ID
                         ability in item rankings generated by models initialized                                                                                1   Item A
                         with different weight seeds[8]. This rank variability is                                                                                2   Item C
                         problematic as it affects the consistency and reliability of                                                                            3   Item B
                         recommendations[9, 10]. Consider the following example                                                                                  4   Item E
                         of ranking variability between two different initializations                                                                            5   Item D
                         of a model (tables 1 and 2):
                            In both initializations, when the models arrive at conver-
                         gence, Item A is consistently ranked as the top item, which
                         is expected as it is the correct positive item that the model                                                typically focuses on ensuring the correct positive item is
                         should prioritize. However, there is significant variability                                                 ranked highest, but does not enforce a specific order for the
                         in the ranking of the other items.                                                                           remaining negative items.
                         For instance, Item B is ranked 2nd in Initialization 1 but                                                   Even when models converge and predict the same top-
                         drops to 3rd in Initialization 2. Similarly, Item C rises from                                               ranked item for a given user sequence, the subsequent items
                         3rd in Initialization 1 to 2nd in Initialization 2. This vari-                                               in the ranking often differ. This inconsistency can affect
                         ability can occur because the loss function used in training                                                 tasks that require multiple item predictions simultaneously,
                                                                                                                                      generate relevance-ordered rankings, and impact the ex-
                          RobustRecSys: Design, Evaluation, and Deployment of Robust Recom-                                           plainability of the model’s recommendations. In particular,
                          mender Systems Workshop @ RecSys 2024, 18 October, 2024, Bari, Italy.                                       we are referring to these related works that address the sen-
                         ⋆
                            This work was partially supported by projects FAIR (PE0000013) and
                                                                                                                                      sitivity and robustness of recommender systems[11]. Oh
                            SERICS (PE00000014) under the MUR National Recovery and Re-
                            silience Plan funded by the European Union - NextGenerationEU.                                            et al.[9] have shown that recommender systems are highly
                            Supported also by the ERC Advanced Grant 788893 AMDROMA,                                                  sensitive to perturbations in the training data, where even
                            EC H2020RIA project “SoBigData++” (871042), PNRR MUR project                                              minor changes can significantly alter the recommendations
                            IR0000013-SoBigData.it. This work has been supported by the project                                       for users. They introduce Rank List Sensitivity (RLS) as a
                            NEREO (Neural Reasoning over Open Data) project funded by the
                                                                                                                                      measure to assess this instability and propose the CASPER
                            Italian Ministry of Education and Research (PRIN) Grant no. 2022AEF-
                            HAZ.                                                                                                      method, which identifies minimal perturbations that induce
                          $ francomano.1883955@studenti.uniroma1.it (B. M. Francomano);                                               significant instability. Their experiments reveal that such
                          siciliano@diag.uniroma1.it (F. Siciliano); fsilvestri@diag.uniroma1.it                                      perturbations, even if minimal, can drastically impact the
                          (F. Silvestri)                                                                                              recommendation lists, particularly for users who receive
                           0000-0003-1339-6983 (F. Siciliano); 0000-0001-7669-9055 (F. Silvestri)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                                                                                                      low-quality recommendations. Similarly, Betello et al.[12]
                                       Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
investigate the robustness of Sequential Recommender Sys-                  dependencies in the sequence of user interactions,
tems (SRSs) in the face of training data perturbations. They               considering only the items that precede a given item
identify limitations in existing robustness measures like                  in the sequence. The attention score 𝛼𝑖𝑗 for items 𝑖
Rank-Biased Overlap (RBO) and propose Finite Rank-Biased                   and 𝑗 (with 𝑗 < 𝑖) is computed as:
Overlap (FRBO), a more suitable metric for finite rankings.
                                                                                                         (q𝑖 )⊤ k𝑗
                                                                                                      (︂           )︂
Their findings highlight that perturbations at the end of a                           𝛼𝑖𝑗 = softmax        √
sequence can severely degrade system performance, em-                                                         𝑑𝑘
phasizing the importance of the position of perturbations                 where q𝑖 = W𝑄 v𝑖 , k𝑗 = W𝐾 v𝑗 , W𝑄 , W𝐾 are
within the training data. To address this challenge, this                 learned weight matrices, and 𝑑𝑘 is the dimension of
paper investigates the root causes of ranking variability,                the key vectors. This mechanism ensures that the
focusing on the role of weight initialisation. We investigate             model only attends to past items, maintaining the
whether this variability is primarily due to the initialisation           sequential nature of the recommendation process.
of item embeddings or to the initialisation of the whole net-
                                                                        • Prediction and Ranking: The attention outputs
work. Through a detailed analysis using metrics such as
                                                                          are used to predict the next item in the sequence,
RLS and NDCG, we show how different weight initialisation
                                                                          generating a ranked list of recommendations based
seeds lead to significant discrepancies in ranking results,
                                                                          on the user’s interaction history. The output of the
highlighting the need for more robust approaches. As a
                                                                          self-attention mechanism yields a context-aware em-
solution, we propose the use of ensemble models, which
                                                                          bedding z𝑖 for each item:
combine the predictions of multiple models initialized with
different seeds. By averaging the scores of these models, we                                           𝑖−1
                                                                                                       ∑︁
can improve the stability of the rankings while maintaining                                     z𝑖 =         𝛼𝑖𝑗 v𝑗
or even improving the overall performance as measured by                                               𝑗=1

the NDCG.
   Our experiments show that ensemble methods effectively                  The next item ˆ𝑖𝑡+1 is predicted by applying a soft-
reduce ranking variability. Specifically, models with differ-              max function over the dot products between z𝑖 and
ent initialization seeds obtain an average RLS-FRBO@10                     the embedding matrix E:
of 0.542, while an ensemble of models achieves a score of                                          (︁
                                                                                  ˆ𝑖𝑡+1 = argmax softmax z⊤
                                                                                                             (︁     )︁)︁
                                                                                                                𝑖 E
0.709. Furthermore, we show that the average NDCG@10
across models with different seeds is 0.129, while an en-
                                                                   For example, if the user interacted with items [𝑖1 , 𝑖2 , . . . , 𝑖𝑡 ],
semble approach improves it to 0.132. This results show
                                                                   the model will generate predictions for the next likely item
that ensemble methods not only reduce variability but also        ˆ𝑖𝑡+1 based on the context-aware embeddings. Assume 𝑖1
maintain or even slightly enhance recommendation quality.
                                                                   corresponds to "Book A", 𝑖2 to "Book B", and so on. If the at-
Our contributions can be summarized as follows:
                                                                   tention mechanism strongly associates "Book B" with "Book
     • We identified the significant impact of weight ini-         D" in the past, "Book D" might be ranked higher as the next
       tialization on the variability of item rankings in rec-     recommendation ˆ𝑖𝑡+1 .
       ommender systems.                                              For this study, the SASRec model configuration includes
     • We proposed the use of ensemble models to reduce            an embedding dimension of 50, a single self-attention head,
       this variability, demonstrating that shared embed-          and a dropout rate of 0.2, the latter being employed to reduce
       dings can further improve ranking consistency.              the risk of overfitting.

                                                                  2.2. Dataset
2. Methodology
                                                                  The dataset employed for this study is the MovieLens 1M
2.1. Sequential Recommendation Model                              (ML-1M) dataset[14], which is a widely recognized bench-
                                                                  mark for evaluating recommendation algorithms. The
The experimental framework employs the SASRec                     MovieLens 1M dataset contains 1 million ratings provided
model[13], a state-of-the-art sequential recommendation           by approximately 6,000 users on 4,000 movies. Each user in
system that uses self-attention mechanisms. The model             the dataset has rated at least 20 movies, making it a dense
processes user interaction sequences in the following             dataset that is well-suited for assessing the performance of
stages:                                                           sequential recommendation models like SASRec.
     • Embedding Layer: Items in the user interaction
       sequence are transformed into dense vector repre-          2.3. Dataset Preparation
       sentations in a continuous vector space. Given an          The experimental dataset is curated with careful considera-
       item 𝑖 from the vocabulary of size 𝑁 , the embedding       tion of the following preprocessing steps:
       layer maps it to a dense vector v𝑖 ∈ R𝑑 , where 𝑑 is
       the embedding dimension:                                         • Rating Threshold: To ensure meaningful interac-
                                                                          tion data, only users who have rated at least 5 items,
                             v𝑖 = E𝑖 x𝑖                                   and items rated by at least 5 users, are retained.
                                                                        • Data Partitioning: A leave-one-out strategy is
       Here, E𝑖 ∈ R𝑑×𝑁 is the embedding matrix, and x𝑖
                                                                          adopted for data splitting. In particular, for each
       is a one-hot encoded vector representing item 𝑖.
                                                                          user, the most recent interaction is used for testing,
     • Unidirectional Self-Attention Mechanism: The                       while the one before the last is reserved for valida-
       SASRec model uses a unidirectional self-attention                  tion. The remaining data constitutes the training
       mechanism, which focuses on identifying temporal                   set.
     • Negative Sampling: Negative sampling is em-                            The Rank-biased Overlap (RBO) measures the sim-
       ployed to balance the dataset. During training and                     ilarity of orderings between two rank lists 𝐴 and
       validation, one negative sample per positive instance                  𝐵. The parameter 𝑝 (typically set to 0.9) controls
       is generated, whereas, in the testing phase, all non-                  the weighting, with higher weights given to the top
       interacted items are considered as negative samples.                   ranks. |𝐼| is the total number of items, and 𝐴[1 : 𝑑]
                                                                              represents the top-𝑑 items in list 𝐴. The RBO score
2.4. Implementation Details                                                   lies between 0 and 1, where higher values indicate
                                                                              more similarity between the rank lists. RLS-JAC, on
The training pipeline is meticulously designed to optimize                    the other hand, uses the Jaccard similarity coeffi-
performance and efficiency:                                                   cient, focusing on the overlap of items between two
                                                                              sets without considering their order.
     • Batch Processing: Training is conducted with a
       batch size of 128, ensuring the model can efficiently                                              |𝐴 ∩ 𝐵|
                                                                                              Jaccard =
       process substantial data per iteration.                                                            |𝐴 ∪ 𝐵|
     • Model Checkpointing: To capture the most op-
       timal model configuration, checkpoints are saved                       where 𝐴 and 𝐵 are the sets of items in the two
       at intervals. The best model is determined based                       ranked lists. Given the limitations of RLS-RBO,
       on the highest NDCG@10 score observed during                           we utilize an enhanced version called Finite Rank-
       validation.                                                            Biased Overlap (FRBO)[12], which is specifically de-
     • Training Setup: The model undergoes training for                       signed for finite-length rankings.
       up to 200 epochs, utilizing the Adam optimizer. The                                       1−𝑝      ∑︀𝑘        𝑑−1 |𝑋[1:𝑑]∩𝑌 [1:𝑑]|
                                                                               𝐹 𝑅𝐵𝑂(𝑋, 𝑌 )@𝑘 = 1−𝑝          𝑑=1 𝑝
       objective function is the Binary Cross-Entropy Loss,                                         𝑘                           𝑑

       which is particularly suited for this task. It is a suit-              Here, 𝑝 is a parameter that controls the weight as-
       able choice because it computes probabilities over                     signed to ranks, and 𝑋[1 : 𝑑] and 𝑌 [1 : 𝑑] represent
       two classes, the positive one and the negatives, since                 the top-𝑑 items in the two ranked lists being com-
       the model has to guess if the items is or not the next                 pared.
       one in the sequence. Additionally, BCE Loss outputs                    The RBO value is now normalized to its maximum
       probabilities that can be used directly to rank items                  possible value, ensuring that the metric reaches 1
       by their likelihood of being the next interaction.                     when the two rankings are identical.
                                                                              FRBO addresses the shortcomings of RBO by cor-
2.5. Metrics                                                                  rectly handling identical rankings, making it more
                                                                              suitable for practical evaluation scenarios.
The evaluation of model performance is based on several                       By comparing different versions of RLS, we aim to
metrics, addressing both predictive accuracy and robustness.                  capture how robustly the models handle perturba-
                                                                              tions, ensuring a more reliable assessment of system
     • Performance Metrics: The primary metric for as-
                                                                              stability. For brevity, we will refer to the Jaccard
       sessing predictive performance is NDCG (Normal-
                                                                              version of RLS as 𝑅𝐿𝑆𝐽 @𝑘, and the FRBO version
       ized Discounted Cumulative Gain), calculated at var-
                                                                              of RLS as 𝑅𝐿𝑆𝐹 @𝑘 in the subsequent sections and
       ious cutoffs (5, 10, and 20). NDCG evaluates the
                                                                              tables. The notation @k indicates a cutoff at the first
       ranking quality of the recommended items, giving
                                                                              k item of the lists.
       higher importance to items ranked closer to the top
       of the list.
     • Robustness Metrics: To evaluate the robustness                  3. Research Questions
       of the recommender system, we use the Rank List
       Sensitivity (RLS) metric.                                       We aim to answer the following questions:

                         1       ∑︁                 𝑋       𝑋               • RQ1: Does a variation of (an apparently) small factor
             𝑅𝐿𝑆 =                            sim(𝑅ℳ𝑘 , 𝑅ℳ𝑘′ )                influence the robustness of Neural Recommender
                       |𝑋test | 𝑋 ∈𝑋
                                𝑘      test
                                                                              Systems?
                                                        𝑋        𝑋          • RQ2: Do ensemble of models improve the robust-
       Here, 𝑋test is the set of test items, 𝑅ℳ𝑘 and 𝑅ℳ𝑘′
                                                                              ness of Neural RecSys?
       are the rank lists generated by the model ℳ and
       a perturbed version ℳ′ for item 𝑋𝑘 . The func-                       • RQ3: Can we improve performance through ensem-
       tion sim(𝐴, 𝐵) measures the similarity between two                     bling?
       rank lists 𝐴 and 𝐵. The RLS metric gives an average
       similarity score over all test items. RLS measures              3.1. RQ1
       how much a model’s recommendations change in
                                                                       Tables 3 and 4 reveal that models initialized with differ-
       response to small perturbations in the training data.
                                                                       ent seeds, which means entirely different weight initializa-
       There are two versions of RLS: the RLS-RBO (Rank-
                                                                       tions, obtain an average Jaccard@10 of 0.505 and an average
       Biased Overlap) and the RLS-JAC (Jaccard Similar-
                                                                       FRBO@10 of 0.542, both of which are significantly below
       ity). RLS-RBO, based on the Rank-Biased Overlap,
                                                                       the ideal value of 1.
       computes the similarity between two ranked lists,
                                                                          When initializing only the item embeddings with the
       but it is tailored for infinite rankings, which can
                                                                       same seed, Tables 5 and 6 show a slight but not significant
       limit its applicability in finite settings.
                                                                       increase, getting a value of 0.514 for the Jaccard@10, and
        𝑅𝐵𝑂(𝐴, 𝐵) = (1 − 𝑝)
                                    ∑︀|𝐼|        𝑑−1 |𝐴[1:𝑑]∩𝐵[1:𝑑]|   0.547 for the FRBO@10.
                                         𝑑=1 𝑝              𝑑
  Seed 1    Seed 2     𝑅𝐿𝑆𝐽 @5          𝑅𝐿𝑆𝐽 @10        𝑅𝐿𝑆𝐽 @20         Seeds 1     Seeds 2    𝑅𝐿𝑆𝐽 @5         𝑅𝐿𝑆𝐽 @10        𝑅𝐿𝑆𝐽 @20
    42        43          0.465            0.508           0.565          43_44       42_44        0.461           0.507           0.564
    42        44          0.464            0.507           0.563          44_42       43_42        0.467           0.515           0.569
    42        45          0.456            0.498           0.555          44_43       42_43        0.465           0.513           0.567
    43        44          0.464            0.507           0.563          42_43       45_43        0.485           0.533           0.586
    43        45          0.463            0.506           0.561          42_44       45_44        0.481           0.529           0.582
    44        45          0.461            0.504           0.560          43_45       44_45        0.494           0.542           0.594
                                                                          44_45       42_45        0.503           0.550           0.602
     AVG ± SD         0.462 ± 0.001    0.505 ± 0.001   0.561 ± 0.001      43_45       42_45        0.513           0.561           0.612
                                                                          43_42       45_42        0.510           0.557           0.609
Table 3                                                                   44_42       45_42        0.516           0.564           0.614
𝑅𝐿𝑆𝐽 computed between models initialized with different seeds,            44_43       45_43        0.511           0.558           0.610
for different values of 𝑘 (5, 10, and 20).                                43_44       45_44        0.518           0.565           0.616
                                                                             AVG ± SD          0.494 ± 0.006   0.541 ± 0.006   0.594 ± 0.005

  Seed 1    Seed 2     𝑅𝐿𝑆𝐹 @5          𝑅𝐿𝑆𝐹 @10        𝑅𝐿𝑆𝐹 @20       Table 7
    42        43          0.499            0.546           0.580       𝑅𝐿𝑆𝐽 computed between models initialized with same rest of
    42        44          0.497            0.544           0.579       the network seeds for different values of 𝑘 (5, 10, and 20). The
    42        45          0.489            0.536           0.571       notation x_y represents a model where the embedding layer is
    43        44          0.497            0.544           0.578
                                                                       initialized with seed x, while the rest of the model is initialized
    43        45          0.495            0.542           0.577
    44        45          0.493            0.541           0.575       with seed y.
     AVG ± SD         0.495 ± 0.001    0.542 ± 0.001   0.577 ± 0.001
                                                                         Seeds 1     Seeds 2    𝑅𝐿𝑆𝐹 @5         𝑅𝐿𝑆𝐹 @10        𝑅𝐿𝑆𝐹 @20
Table 4                                                                   43_44       42_44        0.501           0.546           0.581
𝑅𝐿𝑆𝐹 computed between models initialized with different seeds,            44_42       43_42        0.505           0.551           0.585
for different values of 𝑘 (5, 10, and 20).                                44_43       42_43        0.502           0.549           0.583
                                                                          42_43       45_43        0.522           0.568           0.602
                                                                          42_44       45_44        0.517           0.564           0.598
  Seeds 1   Seeds 2      𝑅𝐿𝑆𝐽 @5         𝑅𝐿𝑆𝐽 @10       𝑅𝐿𝑆𝐽 @20          43_45       44_45        0.529           0.575           0.609
                                                                          44_45       42_45        0.537           0.583           0.617
   43_42     43_45         0.481           0.528           0.581
                                                                          43_45       42_45        0.546           0.593           0.626
   44_42     44_43         0.469           0.517           0.571
                                                                          43_42       45_42        0.543           0.590           0.623
   42_43     42_44         0.467           0.515           0.569
                                                                          44_42       45_42        0.549           0.595           0.628
   42_44     42_45         0.467           0.516           0.569
                                                                          44_43       45_43        0.544           0.591           0.624
   42_43     42_45         0.466           0.513           0.567
                                                                          43_44       45_44        0.551           0.597           0.630
   43_44     43_45         0.465           0.513           0.567
   43_42     43_45         0.465           0.511           0.566             AVG ± SD          0.529 ± 0.005   0.575 ± 0.005   0.609 ± 0.005
   44_43     44_45         0.462           0.510           0.565
   44_42     44_45         0.462           0.511           0.565       Table 8
   45_43     45_42         0.462           0.511           0.565       𝑅𝐿𝑆𝐹 computed between models initialized with same rest of
   45_43     45_44         0.463           0.511           0.566
   45_42     45_44         0.464           0.512           0.566
                                                                       the network seed, for different values of 𝑘 (5, 10, and 20). The
                                                                       notation x_y represents a model where the embedding layer is
      AVG ± SD         0.466 ± 0.001   0.514 ± 0.001   0.568 ± 0.001
                                                                       initialized with seed x, while the rest of the model is initialized
Table 5                                                                with seed y.
𝑅𝐿𝑆𝐽 computed between models initialized with the same em-
bedding seeds, for different values of 𝑘 (5, 10, and 20). The no-
tation x_y represents a model where the embedding layer is                These results suggest that embedding initialization has a
initialized with seed x, while the rest of the model is initialized    limited impact on the final model performance and resulting
with seed y.                                                           rankings. We therefore aim to investigate the relationship
                                                                       between two final trained embedding spaces.
  Seeds 1   Seeds 2     𝑅𝐿𝑆𝐹 @5         𝑅𝐿𝑆𝐹 @10        𝑅𝐿𝑆𝐹 @20
   43_44     43_42         0.514           0.562           0.596       3.1.1. Investigation of Embedding Spaces
   44_42     44_43         0.502           0.551           0.586
   42_43     42_44         0.501           0.549           0.584       To explore the relationship between embeddings from dif-
   42_44     42_45         0.501           0.549           0.584       ferent models, we applied a linear transformation to map
   42_43     42_45         0.498           0.546           0.582
   43_44     43_45         0.498           0.547           0.582       the embeddings from one model to another, followed by a
   43_42     43_45         0.496           0.545           0.580       visualization using Principal Component Analysis (PCA).
   45_43     45_44         0.494           0.543           0.578       This analysis aims to provide insights into how well the
   44_42     44_45         0.495           0.543           0.579
   44_43     44_45         0.496           0.545           0.580
                                                                       embeddings from different models align after the transfor-
                                                                       mation.
      AVG ± SD         0.499 ± 0.002   0.547 ± 0.001   0.582 ± 0.001
                                                                          The experimental procedure involved the following steps:
Table 6
𝑅𝐿𝑆𝐹 computed between models initialized with the same em-
                                                                            1. Embedding matrices: 𝑋 and 𝑌 are the embedding
bedding seeds, for different values of 𝑘 (5, 10, and 20). The no-              matrices of the first and second models, respectively,
tation x_y represents a model where the embedding layer is                     each with a shape of 𝑁 × 𝑑.
initialized with seed x, while the rest of the model is initialized         2. Fitting a Linear Regression Model: We use a
with seed y.                                                                   linear regression model to estimate a transformation
                                                                               matrix 𝑊 that maps the embeddings from Model 1
                                                                               to those of Model 2:
   Instead, initializing the rest of the network with the same
                                                                                                       𝑌ˆ = 𝑋 · 𝑊
seed but using different embedding seeds, leads to a signifi-
cant improvement: in Tables 7 and 8 we observe an value                           where 𝑌ˆ is the transformed embeddings obtained
of 0.541 for the Jaccard@10 and 0.575 for the FRBO@10.                            from Model 1 that should match those of Model 2.
    3. Dimensionality Reduction with PCA: To visual-               Ensemble 1     Ensemble 2      𝑅𝐿𝑆𝐹 @5        𝑅𝐿𝑆𝐹 @10        𝑅𝐿𝑆𝐹 @20

       ize the embeddings, we reduce their dimensionality             42_43         43_45           0.663            0.705          0.733
                                                                      42_43         44_45           0.631            0.673          0.702
       using Principal Component Analysis (PCA). Both the             42_43         43_44           0.646            0.688          0.717
       transformed embeddings 𝑌ˆ and the original embed-              42_43         42_44           0.657            0.698          0.726
                                                                      42_43           44            0.653            0.695          0.724
       dings 𝑌 are projected into a 2D space by retaining
                                                                             AVG ± SD            0.650 ± 0.005   0.692 ± 0.005   0.720 ± 0.005
       the first two principal components:
                                                                  Table 9
               𝑌ˆ PCA = PCA(𝑌ˆ ) 𝑌PCA = PCA(𝑌 )                   RLS FRBO Improvement with Different Ensembles. The notation
                                                                  x_y represents an ensemble formed by averaging the scores of
    4. Visualization: We use scatter plots to visualize the       models initialized with seeds x and y. The last row shows the
       relationship between the transformed and the orig-         computation of RLS between an ensemble and a single model.
       inal embeddings, for each PCA component. A red
       dashed line representing the bisector (𝑦 = 𝑥) is in-
       cluded in both plots to visually assess the alignment      3.3. RQ3
       of the components.
                                                                  To investigate the performance, we present Table 10 show-
                                                                  ing the NDCG scores of the individual models computed
                                                                  @5, @10 and @20.

                                                                      Seed              NDCG@5          NDCG@10              NDCG@20
                                                                        42               0.106               0.132               0.157
                                                                        43               0.105               0.129               0.156
                                                                        44               0.101               0.127               0.152
                                                                        45               0.106               0.130               0.156
                                                                    AVG ± SD        0.105 ± 0.002       0.130 ± 0.002        0.155 ± 0.002
Figure 1: Plot of the PCA components of the original embeddings
of Model 2 and the transformed embeddings.                        Table 10
                                                                  NDCG scores for different seeds.

   Fig. 1 reveals a strong alignment between the transformed
embeddings and the original embeddings from Model 2, as            #Models        NDCG_@5             NDCG_@10               NDCG_@20
the points are closely distributed along the bisector. This
suggests that the linear transformation was highly effective           1          0.105 ± 0.002       0.130 ± 0.002          0.155 ± 0.002
in mapping the embeddings from Model 1 to the embedding                2          0.108 ± 0.001       0.133 ± 0.001          0.160 ± 0.001
                                                                       3          0.110 ± 0.001       0.135 ± 0.001          0.163 ± 0.001
space of Model 2. This finding implies that the embedding
                                                                       4          0.110               0.137                  0.164
layers converge to a similar space, that is different from the
others apart from a linear transformation. Assuming that          Table 11
the transformation matrix W has full rank, it represents an       NDCG values for ensembles of different sizes. The values are
endomorphism, meaning that W is invertible. This implies          averages computed from all possible model combinations for
that the attention mechanism, which applies linear transfor-      each number of models, except for the last row which shows the
mations as described in Section 2.1, can effectively learn to     NDCG scores of a single ensemble formed by all four trained
align the embeddings, regardless of the specific embedding        models.
space to which the layers converge.
   As a result, we argue that the position to which the embed-      In contrast, Table 11 presents the NDCG scores for en-
ding converges is almost independent of the initialization        sembles of different sizes, with values averaged across all
seed. This means that even when the rest of the network           possible model combinations and the final row reflecting
is initialized with different seeds, the embeddings tend to       the NDCG scores of the ensemble formed by all four trained
converge to similar positions in the embedding space. As a        models. The results demonstrate a slight but consistent
result, changing the seed for the rest of the network has a       improvement in performance with larger ensembles.
more significant impact because the representations of the
items in the embedding space remain quite similar regard-
less of the seed, while the rest of the network’s components      4. Considerations on Computational
do not. This highlights that the embedding space is rela-            Cost.
tively stable across different seeds, whereas the rest of the
network is more sensitive to the initial seed.                    In the context of enhancing the robustness of neural recom-
                                                                  mender systems, the use of an ensemble of models provides
                                                                  significant benefits. However, it also introduces a computa-
3.2. RQ2
                                                                  tional cost that scales linearly with the number of models
The next step of the analysis is to combine the scores of         in the ensemble. Specifically, the computational cost of
different models, in order to check if the RLS between com-       training an ensemble is approximately the cost of training a
bined models shows significant changes.                           single model multiplied by the number of models, denoted
   It is evident from Table 9 that using ensembles signifi-       as:
cantly improves the RLS compared to individual models,
indicating better stability.                                                   Costensemble = 𝑁models × Costsingle_model
   The same applies to the inference stage. As shown in               2020, pp. 53–62. URL: https://doi.org/10.1145/3383313.
Tables 9,11, the performance of the system remains relatively         3412248. doi:10.1145/3383313.3412248.
stable regardless of the number of models used, while the         [6] Q. Tan, J. Zhang, J. Yao, N. Liu, J. Zhou, H. Yang,
robustness sees notable improvements when moving from                 X. Hu, Sparse-interest network for sequential rec-
a single model to an ensemble of two models. Given this, a            ommendation, in: Proceedings of the 14th ACM
reasonable compromise is to use an ensemble of two models.            International Conference on Web Search and Data
This choice results in a computational cost of approximately:         Mining (WSDM ’21), Association for Computing Ma-
                                                                      chinery, New York, NY, USA, 2021, pp. 598–606.
                                                                      URL: https://doi.org/10.1145/3437963.3441811. doi:10.
       Costtraining = 2 × Costsingle_model × Iterations               1145/3437963.3441811.
                                                                  [7] A. Sbandi, F. Siciliano, F. Silvestri, Mitigating ex-
  for backpropagation during training, and:
                                                                      treme cold start in graph-based recsys through re-
                                                                      ranking, in: Proceedings of the 33rd ACM Inter-
             Costinference = 2 × Costsingle_model
                                                                      national Conference on Information and Knowledge
   for inference during deployment. This allows for a sig-            Management, CIKM ’24, Association for Computing
nificant improvement in robustness by paying around the               Machinery, New York, NY, USA, 2024, p. 4844–4851.
double in terms of computational cost.                                URL: https://doi.org/10.1145/3627673.3680069. doi:10.
                                                                      1145/3627673.3680069.
                                                                  [8] E. D’Amico, G. Gabbolini, C. Bernardis, P. Cremonesi,
5. Conclusions                                                        Analyzing and improving stability of matrix fac-
                                                                      torization for recommender systems, Journal of
In conclusion, this work addresses the critical issue of rank-
                                                                      Intelligent Information Systems (2022). URL: https:
ing variability in recommender systems, which arises due
                                                                      //doi.org/10.1007/s10844-021-00658-9. doi:10.1007/
to different model initialization seeds. Our findings suggest
                                                                      s10844-021-00658-9, © The Author(s), under ex-
that ensemble methods, particularly those incorporating
                                                                      clusive licence to Springer Science+Business Media,
shared embeddings, offer a promising solution to mitigate
                                                                      LLC, part of Springer Nature 2021.
this variability. By reducing ranking fluctuations, these
                                                                  [9] S. Oh, B. Ustun, J. McAuley, S. Kumar, Rank list sen-
methods enhance the reliability and consistency of recom-
                                                                      sitivity of recommender systems to interaction per-
mendations. However, while our approach significantly
                                                                      turbations, in: Proceedings of the 31st ACM Interna-
reduces variability, some residual variability remains, indi-
                                                                      tional Conference on Information & Knowledge Man-
cating the need for further research to explore additional
                                                                      agement (CIKM ’22), ACM, Atlanta, GA, USA, 2022,
techniques or refinements to achieve even greater consis-
                                                                      pp. 1584–1594. URL: https://doi.org/10.1145/3511808.
tency in recommendations.
                                                                      3557425. doi:10.1145/3511808.3557425.
                                                                 [10] F. Betello, F. Siciliano, P. Mishra, F. Silvestri, Finite
References                                                            rank-biased overlap (frbo): A new measure for stability
                                                                      in sequential recommender systems, in: Proc. of the
 [1] T. F. Boka, Z. Niu, R. B. Neupane, A survey of sequen-           14th Italian Information Retrieval Workshop, volume
     tial recommendation systems: Techniques, evaluation,             3802, 2024, pp. 78–81.
     and future directions, School of Computer Science and       [11] V. Guarrasi, F. Siciliano, F. Silvestri, Robustrecsys
     Technology, Beijing Institute of Technology, Beijing,            @ recsys2024: Design, evaluation and deployment
     China (2024).                                                    of robust recommender systems, in: Proceedings
 [2] F. Betello, A. Purificato, F. Siciliano, G. Trappolini,          of the 18th ACM Conference on Recommender Sys-
     A. Bacciu, N. Tonellotto, F. Silvestri, A reproducible           tems, RecSys ’24, Association for Computing Ma-
     analysis of sequential recommender systems, IEEE                 chinery, New York, NY, USA, 2024, p. 1265–1269.
     Access 13 (2025) 5762–5772. doi:10.1109/ACCESS.                  URL: https://doi.org/10.1145/3640457.3687106. doi:10.
     2024.3522049.                                                    1145/3640457.3687106.
 [3] M. Quadrana, P. Cremonesi, D. Jannach, Sequence-            [12] B. Filippo, S. Federico, M. Pushkar, S. Fabrizio,
     aware recommender systems, ACM Computing                         Investigating the robustness of sequential recom-
     Surveys (CSUR) 51 (2018) 66:1–66:36. doi:10.1145/                mender systems against training data perturbations,
     3190616.                                                         in: Advances in Information Retrieval: 46th Eu-
 [4] U. Singer, H. Roitman, Y. Eshel, A. Nus, I. Guy, O. Levi,        ropean Conference on Information Retrieval (ECIR
     I. Hasson, E. Kiperwasser, Sequential modeling                   2024), Springer, 2024, pp. 205–220. URL: https://
     with multiple attributes for watchlist recommenda-               doi.org/10.1007/978-3-031-28241-6_14. doi:10.1007/
     tion in e-commerce, in: Proceedings of the Fifteenth             978-3-031-28241-6_14, first Online: 16 March
     ACM International Conference on Web Search and                   2024.
     Data Mining (WSDM ’22), Association for Computing           [13] W.-C. Kang, J. McAuley, Self-attentive sequential rec-
     Machinery, New York, NY, USA, 2022, pp. 937–946.                 ommendation, arXiv preprint arXiv:1808.09781 (2018).
     URL: https://doi.org/10.1145/3488560.3498453. doi:10.            URL: https://arxiv.org/abs/1808.09781.
     1145/3488560.3498453.                                       [14] F. M. Harper, J. A. Konstan, The movielens datasets:
 [5] C. Hansen, C. Hansen, L. Maystre, R. Mehrotra,                   History and context, ACM Transactions on In-
     B. Brost, F. Tomasi, M. Lalmas, Contextual and se-               teractive Intelligent Systems (TiiS) 5 (2015) 1–19.
     quential user embeddings for large-scale music recom-            URL: https://doi.org/10.1145/2827872. doi:10.1145/
     mendation, in: Proceedings of the 14th ACM Confer-               2827872, published: 22 December 2015.
     ence on Recommender Systems (RecSys ’20), Associa-
     tion for Computing Machinery, New York, NY, USA,