=Paper= {{Paper |id=Vol-3924/short7 |storemode=property |title=Robust Solutions for Ranking Variability in Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-3924/short7.pdf |volume=Vol-3924 |authors=Bonifacio Marco Francomano,Federico Siciliano,Fabrizio Silvestri |dblpUrl=https://dblp.org/rec/conf/robustrecsys/FrancomanoSS24 }} ==Robust Solutions for Ranking Variability in Recommender Systems== https://ceur-ws.org/Vol-3924/short7.pdf

Robust Solutions for Ranking Variability in Recommender
Systems⋆
Bonifacio Marco Francomano1 , Federico Siciliano1 and Fabrizio Silvestri1
1
Sapienza University of Rome, Rome, Italy

Abstract
In the field of recommender systems, an important issue within the current state-of-the-art is the inconsistency in item rankings
produced by models initialized with different weight seeds. Despite these models achieve convergence and obtain similar average
performance metrics, their item rankings differ significantly. This phenomenon is quantitavely demonstrated using metrics such as
Rank List Sensitivity (RLS) and Normalized Discounted Cumulative Gain (NDCG) across different model pairs. In this paper, we reaffirm
the existence of this problem and provide new insights by analysing models with common item embeddings but different network
initialization, and different item embeddings but common network initialization, to identify which network components most influence
ranking variability. To address the general issue, we propose an ensemble approach that averages the output of multiple models. Our
ensemble maintains the NDCG of the original model while significantly improving ranking stability: the RLS FRBO@10 value shows an
approximate increase of 30.82%.

Keywords
Recommender Systems, Evaluation of Recommender Systems, Model Stability

1. Introduction Table 1
Ranking Initialization 1
In the recent years, neural sequential recommender systems
have gained importance due to their ability to model user Rank Item ID
behavior over time, providing more accurate and person- 1 Item A
alized recommendations[1, 2]. Unlike traditional recom- 2 Item B
mender systems that consider user preferences in a static 3 Item C
context, neural sequential recommender systems capture 4 Item D
the temporal dynamics of user interactions[3]. This capabil- 5 Item E
ity is central in domains such as e-commerce[4], streaming
services[5], and social media[6]. By analyzing the sequence
of items a user interacts with, these systems can predict Table 2
future preferences[7]. Ranking Initialization 2
Despite the advancements in neural sequential recom-
mender systems, a significant issue persists: the vari- Rank Item ID
ability in item rankings generated by models initialized 1 Item A
with different weight seeds[8]. This rank variability is 2 Item C
problematic as it affects the consistency and reliability of 3 Item B
recommendations[9, 10]. Consider the following example 4 Item E
of ranking variability between two different initializations 5 Item D
of a model (tables 1 and 2):
In both initializations, when the models arrive at conver-
gence, Item A is consistently ranked as the top item, which
is expected as it is the correct positive item that the model typically focuses on ensuring the correct positive item is
should prioritize. However, there is significant variability ranked highest, but does not enforce a specific order for the
in the ranking of the other items. remaining negative items.
For instance, Item B is ranked 2nd in Initialization 1 but Even when models converge and predict the same top-
drops to 3rd in Initialization 2. Similarly, Item C rises from ranked item for a given user sequence, the subsequent items
3rd in Initialization 1 to 2nd in Initialization 2. This vari- in the ranking often differ. This inconsistency can affect
ability can occur because the loss function used in training tasks that require multiple item predictions simultaneously,
generate relevance-ordered rankings, and impact the ex-
RobustRecSys: Design, Evaluation, and Deployment of Robust Recom- plainability of the model’s recommendations. In particular,
mender Systems Workshop @ RecSys 2024, 18 October, 2024, Bari, Italy. we are referring to these related works that address the sen-
⋆
This work was partially supported by projects FAIR (PE0000013) and
sitivity and robustness of recommender systems[11]. Oh
SERICS (PE00000014) under the MUR National Recovery and Re-
silience Plan funded by the European Union - NextGenerationEU. et al.[9] have shown that recommender systems are highly
Supported also by the ERC Advanced Grant 788893 AMDROMA, sensitive to perturbations in the training data, where even
EC H2020RIA project “SoBigData++” (871042), PNRR MUR project minor changes can significantly alter the recommendations
IR0000013-SoBigData.it. This work has been supported by the project for users. They introduce Rank List Sensitivity (RLS) as a
NEREO (Neural Reasoning over Open Data) project funded by the
measure to assess this instability and propose the CASPER
Italian Ministry of Education and Research (PRIN) Grant no. 2022AEF-
HAZ. method, which identifies minimal perturbations that induce
$ francomano.1883955@studenti.uniroma1.it (B. M. Francomano); significant instability. Their experiments reveal that such
siciliano@diag.uniroma1.it (F. Siciliano); fsilvestri@diag.uniroma1.it perturbations, even if minimal, can drastically impact the
(F. Silvestri) recommendation lists, particularly for users who receive
0000-0003-1339-6983 (F. Siciliano); 0000-0001-7669-9055 (F. Silvestri)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
low-quality recommendations. Similarly, Betello et al.[12]
Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
investigate the robustness of Sequential Recommender Sys- dependencies in the sequence of user interactions,
tems (SRSs) in the face of training data perturbations. They considering only the items that precede a given item
identify limitations in existing robustness measures like in the sequence. The attention score 𝛼𝑖𝑗 for items 𝑖
Rank-Biased Overlap (RBO) and propose Finite Rank-Biased and 𝑗 (with 𝑗 < 𝑖) is computed as:
Overlap (FRBO), a more suitable metric for finite rankings.
(q𝑖 )⊤ k𝑗
(︂ )︂
Their findings highlight that perturbations at the end of a 𝛼𝑖𝑗 = softmax √
sequence can severely degrade system performance, em- 𝑑𝑘
phasizing the importance of the position of perturbations where q𝑖 = W𝑄 v𝑖 , k𝑗 = W𝐾 v𝑗 , W𝑄 , W𝐾 are
within the training data. To address this challenge, this learned weight matrices, and 𝑑𝑘 is the dimension of
paper investigates the root causes of ranking variability, the key vectors. This mechanism ensures that the
focusing on the role of weight initialisation. We investigate model only attends to past items, maintaining the
whether this variability is primarily due to the initialisation sequential nature of the recommendation process.
of item embeddings or to the initialisation of the whole net-
• Prediction and Ranking: The attention outputs
work. Through a detailed analysis using metrics such as
are used to predict the next item in the sequence,
RLS and NDCG, we show how different weight initialisation
generating a ranked list of recommendations based
seeds lead to significant discrepancies in ranking results,
on the user’s interaction history. The output of the
highlighting the need for more robust approaches. As a
self-attention mechanism yields a context-aware em-
solution, we propose the use of ensemble models, which
bedding z𝑖 for each item:
combine the predictions of multiple models initialized with
different seeds. By averaging the scores of these models, we 𝑖−1
∑︁
can improve the stability of the rankings while maintaining z𝑖 = 𝛼𝑖𝑗 v𝑗
or even improving the overall performance as measured by 𝑗=1

the NDCG.
Our experiments show that ensemble methods effectively The next item ˆ𝑖𝑡+1 is predicted by applying a soft-
reduce ranking variability. Specifically, models with differ- max function over the dot products between z𝑖 and
ent initialization seeds obtain an average RLS-FRBO@10 the embedding matrix E:
of 0.542, while an ensemble of models achieves a score of (︁
ˆ𝑖𝑡+1 = argmax softmax z⊤
(︁ )︁)︁
𝑖 E
0.709. Furthermore, we show that the average NDCG@10
across models with different seeds is 0.129, while an en-
For example, if the user interacted with items [𝑖1 , 𝑖2 , . . . , 𝑖𝑡 ],
semble approach improves it to 0.132. This results show
the model will generate predictions for the next likely item
that ensemble methods not only reduce variability but also ˆ𝑖𝑡+1 based on the context-aware embeddings. Assume 𝑖1
maintain or even slightly enhance recommendation quality.
corresponds to "Book A", 𝑖2 to "Book B", and so on. If the at-
Our contributions can be summarized as follows:
tention mechanism strongly associates "Book B" with "Book
• We identified the significant impact of weight ini- D" in the past, "Book D" might be ranked higher as the next
tialization on the variability of item rankings in rec- recommendation ˆ𝑖𝑡+1 .
ommender systems. For this study, the SASRec model configuration includes
• We proposed the use of ensemble models to reduce an embedding dimension of 50, a single self-attention head,
this variability, demonstrating that shared embed- and a dropout rate of 0.2, the latter being employed to reduce
dings can further improve ranking consistency. the risk of overfitting.

2.2. Dataset
2. Methodology
The dataset employed for this study is the MovieLens 1M
2.1. Sequential Recommendation Model (ML-1M) dataset[14], which is a widely recognized bench-
mark for evaluating recommendation algorithms. The
The experimental framework employs the SASRec MovieLens 1M dataset contains 1 million ratings provided
model[13], a state-of-the-art sequential recommendation by approximately 6,000 users on 4,000 movies. Each user in
system that uses self-attention mechanisms. The model the dataset has rated at least 20 movies, making it a dense
processes user interaction sequences in the following dataset that is well-suited for assessing the performance of
stages: sequential recommendation models like SASRec.
• Embedding Layer: Items in the user interaction
sequence are transformed into dense vector repre- 2.3. Dataset Preparation
sentations in a continuous vector space. Given an The experimental dataset is curated with careful considera-
item 𝑖 from the vocabulary of size 𝑁 , the embedding tion of the following preprocessing steps:
layer maps it to a dense vector v𝑖 ∈ R𝑑 , where 𝑑 is
the embedding dimension: • Rating Threshold: To ensure meaningful interac-
tion data, only users who have rated at least 5 items,
v𝑖 = E𝑖 x𝑖 and items rated by at least 5 users, are retained.
• Data Partitioning: A leave-one-out strategy is
Here, E𝑖 ∈ R𝑑×𝑁 is the embedding matrix, and x𝑖
adopted for data splitting. In particular, for each
is a one-hot encoded vector representing item 𝑖.
user, the most recent interaction is used for testing,
• Unidirectional Self-Attention Mechanism: The while the one before the last is reserved for valida-
SASRec model uses a unidirectional self-attention tion. The remaining data constitutes the training
mechanism, which focuses on identifying temporal set.
• Negative Sampling: Negative sampling is em- The Rank-biased Overlap (RBO) measures the sim-
ployed to balance the dataset. During training and ilarity of orderings between two rank lists 𝐴 and
validation, one negative sample per positive instance 𝐵. The parameter 𝑝 (typically set to 0.9) controls
is generated, whereas, in the testing phase, all non- the weighting, with higher weights given to the top
interacted items are considered as negative samples. ranks. |𝐼| is the total number of items, and 𝐴[1 : 𝑑]
represents the top-𝑑 items in list 𝐴. The RBO score
2.4. Implementation Details lies between 0 and 1, where higher values indicate
more similarity between the rank lists. RLS-JAC, on
The training pipeline is meticulously designed to optimize the other hand, uses the Jaccard similarity coeffi-
performance and efficiency: cient, focusing on the overlap of items between two
sets without considering their order.
• Batch Processing: Training is conducted with a
batch size of 128, ensuring the model can efficiently |𝐴 ∩ 𝐵|
Jaccard =
process substantial data per iteration. |𝐴 ∪ 𝐵|
• Model Checkpointing: To capture the most op-
timal model configuration, checkpoints are saved where 𝐴 and 𝐵 are the sets of items in the two
at intervals. The best model is determined based ranked lists. Given the limitations of RLS-RBO,
on the highest NDCG@10 score observed during we utilize an enhanced version called Finite Rank-
validation. Biased Overlap (FRBO)[12], which is specifically de-
• Training Setup: The model undergoes training for signed for finite-length rankings.
up to 200 epochs, utilizing the Adam optimizer. The 1−𝑝 ∑︀𝑘 𝑑−1 |𝑋[1:𝑑]∩𝑌 [1:𝑑]|
𝐹 𝑅𝐵𝑂(𝑋, 𝑌 )@𝑘 = 1−𝑝 𝑑=1 𝑝
objective function is the Binary Cross-Entropy Loss, 𝑘 𝑑

which is particularly suited for this task. It is a suit- Here, 𝑝 is a parameter that controls the weight as-
able choice because it computes probabilities over signed to ranks, and 𝑋[1 : 𝑑] and 𝑌 [1 : 𝑑] represent
two classes, the positive one and the negatives, since the top-𝑑 items in the two ranked lists being com-
the model has to guess if the items is or not the next pared.
one in the sequence. Additionally, BCE Loss outputs The RBO value is now normalized to its maximum
probabilities that can be used directly to rank items possible value, ensuring that the metric reaches 1
by their likelihood of being the next interaction. when the two rankings are identical.
FRBO addresses the shortcomings of RBO by cor-
2.5. Metrics rectly handling identical rankings, making it more
suitable for practical evaluation scenarios.
The evaluation of model performance is based on several By comparing different versions of RLS, we aim to
metrics, addressing both predictive accuracy and robustness. capture how robustly the models handle perturba-
tions, ensuring a more reliable assessment of system
• Performance Metrics: The primary metric for as-
stability. For brevity, we will refer to the Jaccard
sessing predictive performance is NDCG (Normal-
version of RLS as 𝑅𝐿𝑆𝐽 @𝑘, and the FRBO version
ized Discounted Cumulative Gain), calculated at var-
of RLS as 𝑅𝐿𝑆𝐹 @𝑘 in the subsequent sections and
ious cutoffs (5, 10, and 20). NDCG evaluates the
tables. The notation @k indicates a cutoff at the first
ranking quality of the recommended items, giving
k item of the lists.
higher importance to items ranked closer to the top
of the list.
• Robustness Metrics: To evaluate the robustness 3. Research Questions
of the recommender system, we use the Rank List
Sensitivity (RLS) metric. We aim to answer the following questions:

1 ∑︁ 𝑋 𝑋 • RQ1: Does a variation of (an apparently) small factor
𝑅𝐿𝑆 = sim(𝑅ℳ𝑘 , 𝑅ℳ𝑘′ ) influence the robustness of Neural Recommender
|𝑋test | 𝑋 ∈𝑋
𝑘 test
Systems?
𝑋 𝑋 • RQ2: Do ensemble of models improve the robust-
Here, 𝑋test is the set of test items, 𝑅ℳ𝑘 and 𝑅ℳ𝑘′
ness of Neural RecSys?
are the rank lists generated by the model ℳ and
a perturbed version ℳ′ for item 𝑋𝑘 . The func- • RQ3: Can we improve performance through ensem-
tion sim(𝐴, 𝐵) measures the similarity between two bling?
rank lists 𝐴 and 𝐵. The RLS metric gives an average
similarity score over all test items. RLS measures 3.1. RQ1
how much a model’s recommendations change in
Tables 3 and 4 reveal that models initialized with differ-
response to small perturbations in the training data.
ent seeds, which means entirely different weight initializa-
There are two versions of RLS: the RLS-RBO (Rank-
tions, obtain an average Jaccard@10 of 0.505 and an average
Biased Overlap) and the RLS-JAC (Jaccard Similar-
FRBO@10 of 0.542, both of which are significantly below
ity). RLS-RBO, based on the Rank-Biased Overlap,
the ideal value of 1.
computes the similarity between two ranked lists,
When initializing only the item embeddings with the
but it is tailored for infinite rankings, which can
same seed, Tables 5 and 6 show a slight but not significant
limit its applicability in finite settings.
increase, getting a value of 0.514 for the Jaccard@10, and
𝑅𝐵𝑂(𝐴, 𝐵) = (1 − 𝑝)
∑︀|𝐼| 𝑑−1 |𝐴[1:𝑑]∩𝐵[1:𝑑]| 0.547 for the FRBO@10.
𝑑=1 𝑝 𝑑
Seed 1 Seed 2 𝑅𝐿𝑆𝐽 @5 𝑅𝐿𝑆𝐽 @10 𝑅𝐿𝑆𝐽 @20 Seeds 1 Seeds 2 𝑅𝐿𝑆𝐽 @5 𝑅𝐿𝑆𝐽 @10 𝑅𝐿𝑆𝐽 @20
42 43 0.465 0.508 0.565 43_44 42_44 0.461 0.507 0.564
42 44 0.464 0.507 0.563 44_42 43_42 0.467 0.515 0.569
42 45 0.456 0.498 0.555 44_43 42_43 0.465 0.513 0.567
43 44 0.464 0.507 0.563 42_43 45_43 0.485 0.533 0.586
43 45 0.463 0.506 0.561 42_44 45_44 0.481 0.529 0.582
44 45 0.461 0.504 0.560 43_45 44_45 0.494 0.542 0.594
44_45 42_45 0.503 0.550 0.602
AVG ± SD 0.462 ± 0.001 0.505 ± 0.001 0.561 ± 0.001 43_45 42_45 0.513 0.561 0.612
43_42 45_42 0.510 0.557 0.609
Table 3 44_42 45_42 0.516 0.564 0.614
𝑅𝐿𝑆𝐽 computed between models initialized with different seeds, 44_43 45_43 0.511 0.558 0.610
for different values of 𝑘 (5, 10, and 20). 43_44 45_44 0.518 0.565 0.616
AVG ± SD 0.494 ± 0.006 0.541 ± 0.006 0.594 ± 0.005

Seed 1 Seed 2 𝑅𝐿𝑆𝐹 @5 𝑅𝐿𝑆𝐹 @10 𝑅𝐿𝑆𝐹 @20 Table 7
42 43 0.499 0.546 0.580 𝑅𝐿𝑆𝐽 computed between models initialized with same rest of
42 44 0.497 0.544 0.579 the network seeds for different values of 𝑘 (5, 10, and 20). The
42 45 0.489 0.536 0.571 notation x_y represents a model where the embedding layer is
43 44 0.497 0.544 0.578
initialized with seed x, while the rest of the model is initialized
43 45 0.495 0.542 0.577
44 45 0.493 0.541 0.575 with seed y.
AVG ± SD 0.495 ± 0.001 0.542 ± 0.001 0.577 ± 0.001
Seeds 1 Seeds 2 𝑅𝐿𝑆𝐹 @5 𝑅𝐿𝑆𝐹 @10 𝑅𝐿𝑆𝐹 @20
Table 4 43_44 42_44 0.501 0.546 0.581
𝑅𝐿𝑆𝐹 computed between models initialized with different seeds, 44_42 43_42 0.505 0.551 0.585
for different values of 𝑘 (5, 10, and 20). 44_43 42_43 0.502 0.549 0.583
42_43 45_43 0.522 0.568 0.602
42_44 45_44 0.517 0.564 0.598
Seeds 1 Seeds 2 𝑅𝐿𝑆𝐽 @5 𝑅𝐿𝑆𝐽 @10 𝑅𝐿𝑆𝐽 @20 43_45 44_45 0.529 0.575 0.609
44_45 42_45 0.537 0.583 0.617
43_42 43_45 0.481 0.528 0.581
43_45 42_45 0.546 0.593 0.626
44_42 44_43 0.469 0.517 0.571
43_42 45_42 0.543 0.590 0.623
42_43 42_44 0.467 0.515 0.569
44_42 45_42 0.549 0.595 0.628
42_44 42_45 0.467 0.516 0.569
44_43 45_43 0.544 0.591 0.624
42_43 42_45 0.466 0.513 0.567
43_44 45_44 0.551 0.597 0.630
43_44 43_45 0.465 0.513 0.567
43_42 43_45 0.465 0.511 0.566 AVG ± SD 0.529 ± 0.005 0.575 ± 0.005 0.609 ± 0.005
44_43 44_45 0.462 0.510 0.565
44_42 44_45 0.462 0.511 0.565 Table 8
45_43 45_42 0.462 0.511 0.565 𝑅𝐿𝑆𝐹 computed between models initialized with same rest of
45_43 45_44 0.463 0.511 0.566
45_42 45_44 0.464 0.512 0.566
the network seed, for different values of 𝑘 (5, 10, and 20). The
notation x_y represents a model where the embedding layer is
AVG ± SD 0.466 ± 0.001 0.514 ± 0.001 0.568 ± 0.001
initialized with seed x, while the rest of the model is initialized
Table 5 with seed y.
𝑅𝐿𝑆𝐽 computed between models initialized with the same em-
bedding seeds, for different values of 𝑘 (5, 10, and 20). The no-
tation x_y represents a model where the embedding layer is These results suggest that embedding initialization has a
initialized with seed x, while the rest of the model is initialized limited impact on the final model performance and resulting
with seed y. rankings. We therefore aim to investigate the relationship
between two final trained embedding spaces.
Seeds 1 Seeds 2 𝑅𝐿𝑆𝐹 @5 𝑅𝐿𝑆𝐹 @10 𝑅𝐿𝑆𝐹 @20
43_44 43_42 0.514 0.562 0.596 3.1.1. Investigation of Embedding Spaces
44_42 44_43 0.502 0.551 0.586
42_43 42_44 0.501 0.549 0.584 To explore the relationship between embeddings from dif-
42_44 42_45 0.501 0.549 0.584 ferent models, we applied a linear transformation to map
42_43 42_45 0.498 0.546 0.582
43_44 43_45 0.498 0.547 0.582 the embeddings from one model to another, followed by a
43_42 43_45 0.496 0.545 0.580 visualization using Principal Component Analysis (PCA).
45_43 45_44 0.494 0.543 0.578 This analysis aims to provide insights into how well the
44_42 44_45 0.495 0.543 0.579
44_43 44_45 0.496 0.545 0.580
embeddings from different models align after the transfor-
mation.
AVG ± SD 0.499 ± 0.002 0.547 ± 0.001 0.582 ± 0.001
The experimental procedure involved the following steps:
Table 6
𝑅𝐿𝑆𝐹 computed between models initialized with the same em-
1. Embedding matrices: 𝑋 and 𝑌 are the embedding
bedding seeds, for different values of 𝑘 (5, 10, and 20). The no- matrices of the first and second models, respectively,
tation x_y represents a model where the embedding layer is each with a shape of 𝑁 × 𝑑.
initialized with seed x, while the rest of the model is initialized 2. Fitting a Linear Regression Model: We use a
with seed y. linear regression model to estimate a transformation
matrix 𝑊 that maps the embeddings from Model 1
to those of Model 2:
Instead, initializing the rest of the network with the same
𝑌ˆ = 𝑋 · 𝑊
seed but using different embedding seeds, leads to a signifi-
cant improvement: in Tables 7 and 8 we observe an value where 𝑌ˆ is the transformed embeddings obtained
of 0.541 for the Jaccard@10 and 0.575 for the FRBO@10. from Model 1 that should match those of Model 2.
3. Dimensionality Reduction with PCA: To visual- Ensemble 1 Ensemble 2 𝑅𝐿𝑆𝐹 @5 𝑅𝐿𝑆𝐹 @10 𝑅𝐿𝑆𝐹 @20

ize the embeddings, we reduce their dimensionality 42_43 43_45 0.663 0.705 0.733
42_43 44_45 0.631 0.673 0.702
using Principal Component Analysis (PCA). Both the 42_43 43_44 0.646 0.688 0.717
transformed embeddings 𝑌ˆ and the original embed- 42_43 42_44 0.657 0.698 0.726
42_43 44 0.653 0.695 0.724
dings 𝑌 are projected into a 2D space by retaining
AVG ± SD 0.650 ± 0.005 0.692 ± 0.005 0.720 ± 0.005
the first two principal components:
Table 9
𝑌ˆ PCA = PCA(𝑌ˆ ) 𝑌PCA = PCA(𝑌 ) RLS FRBO Improvement with Different Ensembles. The notation
x_y represents an ensemble formed by averaging the scores of
4. Visualization: We use scatter plots to visualize the models initialized with seeds x and y. The last row shows the
relationship between the transformed and the orig- computation of RLS between an ensemble and a single model.
inal embeddings, for each PCA component. A red
dashed line representing the bisector (𝑦 = 𝑥) is in-
cluded in both plots to visually assess the alignment 3.3. RQ3
of the components.
To investigate the performance, we present Table 10 show-
ing the NDCG scores of the individual models computed
@5, @10 and @20.

Seed NDCG@5 NDCG@10 NDCG@20
42 0.106 0.132 0.157
43 0.105 0.129 0.156
44 0.101 0.127 0.152
45 0.106 0.130 0.156
AVG ± SD 0.105 ± 0.002 0.130 ± 0.002 0.155 ± 0.002
Figure 1: Plot of the PCA components of the original embeddings
of Model 2 and the transformed embeddings. Table 10
NDCG scores for different seeds.

Fig. 1 reveals a strong alignment between the transformed
embeddings and the original embeddings from Model 2, as #Models NDCG_@5 NDCG_@10 NDCG_@20
the points are closely distributed along the bisector. This
suggests that the linear transformation was highly effective 1 0.105 ± 0.002 0.130 ± 0.002 0.155 ± 0.002
in mapping the embeddings from Model 1 to the embedding 2 0.108 ± 0.001 0.133 ± 0.001 0.160 ± 0.001
3 0.110 ± 0.001 0.135 ± 0.001 0.163 ± 0.001
space of Model 2. This finding implies that the embedding
4 0.110 0.137 0.164
layers converge to a similar space, that is different from the
others apart from a linear transformation. Assuming that Table 11
the transformation matrix W has full rank, it represents an NDCG values for ensembles of different sizes. The values are
endomorphism, meaning that W is invertible. This implies averages computed from all possible model combinations for
that the attention mechanism, which applies linear transfor- each number of models, except for the last row which shows the
mations as described in Section 2.1, can effectively learn to NDCG scores of a single ensemble formed by all four trained
align the embeddings, regardless of the specific embedding models.
space to which the layers converge.
As a result, we argue that the position to which the embed- In contrast, Table 11 presents the NDCG scores for en-
ding converges is almost independent of the initialization sembles of different sizes, with values averaged across all
seed. This means that even when the rest of the network possible model combinations and the final row reflecting
is initialized with different seeds, the embeddings tend to the NDCG scores of the ensemble formed by all four trained
converge to similar positions in the embedding space. As a models. The results demonstrate a slight but consistent
result, changing the seed for the rest of the network has a improvement in performance with larger ensembles.
more significant impact because the representations of the
items in the embedding space remain quite similar regard-
less of the seed, while the rest of the network’s components 4. Considerations on Computational
do not. This highlights that the embedding space is rela- Cost.
tively stable across different seeds, whereas the rest of the
network is more sensitive to the initial seed. In the context of enhancing the robustness of neural recom-
mender systems, the use of an ensemble of models provides
significant benefits. However, it also introduces a computa-
3.2. RQ2
tional cost that scales linearly with the number of models
The next step of the analysis is to combine the scores of in the ensemble. Specifically, the computational cost of
different models, in order to check if the RLS between com- training an ensemble is approximately the cost of training a
bined models shows significant changes. single model multiplied by the number of models, denoted
It is evident from Table 9 that using ensembles signifi- as:
cantly improves the RLS compared to individual models,
indicating better stability. Costensemble = 𝑁models × Costsingle_model
The same applies to the inference stage. As shown in 2020, pp. 53–62. URL: https://doi.org/10.1145/3383313.
Tables 9,11, the performance of the system remains relatively 3412248. doi:10.1145/3383313.3412248.
stable regardless of the number of models used, while the [6] Q. Tan, J. Zhang, J. Yao, N. Liu, J. Zhou, H. Yang,
robustness sees notable improvements when moving from X. Hu, Sparse-interest network for sequential rec-
a single model to an ensemble of two models. Given this, a ommendation, in: Proceedings of the 14th ACM
reasonable compromise is to use an ensemble of two models. International Conference on Web Search and Data
This choice results in a computational cost of approximately: Mining (WSDM ’21), Association for Computing Ma-
chinery, New York, NY, USA, 2021, pp. 598–606.
URL: https://doi.org/10.1145/3437963.3441811. doi:10.
Costtraining = 2 × Costsingle_model × Iterations 1145/3437963.3441811.
[7] A. Sbandi, F. Siciliano, F. Silvestri, Mitigating ex-
for backpropagation during training, and:
treme cold start in graph-based recsys through re-
ranking, in: Proceedings of the 33rd ACM Inter-
Costinference = 2 × Costsingle_model
national Conference on Information and Knowledge
for inference during deployment. This allows for a sig- Management, CIKM ’24, Association for Computing
nificant improvement in robustness by paying around the Machinery, New York, NY, USA, 2024, p. 4844–4851.
double in terms of computational cost. URL: https://doi.org/10.1145/3627673.3680069. doi:10.
1145/3627673.3680069.
[8] E. D’Amico, G. Gabbolini, C. Bernardis, P. Cremonesi,
5. Conclusions Analyzing and improving stability of matrix fac-
torization for recommender systems, Journal of
In conclusion, this work addresses the critical issue of rank-
Intelligent Information Systems (2022). URL: https:
ing variability in recommender systems, which arises due
//doi.org/10.1007/s10844-021-00658-9. doi:10.1007/
to different model initialization seeds. Our findings suggest
s10844-021-00658-9, © The Author(s), under ex-
that ensemble methods, particularly those incorporating
clusive licence to Springer Science+Business Media,
shared embeddings, offer a promising solution to mitigate
LLC, part of Springer Nature 2021.
this variability. By reducing ranking fluctuations, these
[9] S. Oh, B. Ustun, J. McAuley, S. Kumar, Rank list sen-
methods enhance the reliability and consistency of recom-
sitivity of recommender systems to interaction per-
mendations. However, while our approach significantly
turbations, in: Proceedings of the 31st ACM Interna-
reduces variability, some residual variability remains, indi-
tional Conference on Information & Knowledge Man-
cating the need for further research to explore additional
agement (CIKM ’22), ACM, Atlanta, GA, USA, 2022,
techniques or refinements to achieve even greater consis-
pp. 1584–1594. URL: https://doi.org/10.1145/3511808.
tency in recommendations.
3557425. doi:10.1145/3511808.3557425.
[10] F. Betello, F. Siciliano, P. Mishra, F. Silvestri, Finite
References rank-biased overlap (frbo): A new measure for stability
in sequential recommender systems, in: Proc. of the
[1] T. F. Boka, Z. Niu, R. B. Neupane, A survey of sequen- 14th Italian Information Retrieval Workshop, volume
tial recommendation systems: Techniques, evaluation, 3802, 2024, pp. 78–81.
and future directions, School of Computer Science and [11] V. Guarrasi, F. Siciliano, F. Silvestri, Robustrecsys
Technology, Beijing Institute of Technology, Beijing, @ recsys2024: Design, evaluation and deployment
China (2024). of robust recommender systems, in: Proceedings
[2] F. Betello, A. Purificato, F. Siciliano, G. Trappolini, of the 18th ACM Conference on Recommender Sys-
A. Bacciu, N. Tonellotto, F. Silvestri, A reproducible tems, RecSys ’24, Association for Computing Ma-
analysis of sequential recommender systems, IEEE chinery, New York, NY, USA, 2024, p. 1265–1269.
Access 13 (2025) 5762–5772. doi:10.1109/ACCESS. URL: https://doi.org/10.1145/3640457.3687106. doi:10.
2024.3522049. 1145/3640457.3687106.
[3] M. Quadrana, P. Cremonesi, D. Jannach, Sequence- [12] B. Filippo, S. Federico, M. Pushkar, S. Fabrizio,
aware recommender systems, ACM Computing Investigating the robustness of sequential recom-
Surveys (CSUR) 51 (2018) 66:1–66:36. doi:10.1145/ mender systems against training data perturbations,
3190616. in: Advances in Information Retrieval: 46th Eu-
[4] U. Singer, H. Roitman, Y. Eshel, A. Nus, I. Guy, O. Levi, ropean Conference on Information Retrieval (ECIR
I. Hasson, E. Kiperwasser, Sequential modeling 2024), Springer, 2024, pp. 205–220. URL: https://
with multiple attributes for watchlist recommenda- doi.org/10.1007/978-3-031-28241-6_14. doi:10.1007/
tion in e-commerce, in: Proceedings of the Fifteenth 978-3-031-28241-6_14, first Online: 16 March
ACM International Conference on Web Search and 2024.
Data Mining (WSDM ’22), Association for Computing [13] W.-C. Kang, J. McAuley, Self-attentive sequential rec-
Machinery, New York, NY, USA, 2022, pp. 937–946. ommendation, arXiv preprint arXiv:1808.09781 (2018).
URL: https://doi.org/10.1145/3488560.3498453. doi:10. URL: https://arxiv.org/abs/1808.09781.
1145/3488560.3498453. [14] F. M. Harper, J. A. Konstan, The movielens datasets:
[5] C. Hansen, C. Hansen, L. Maystre, R. Mehrotra, History and context, ACM Transactions on In-
B. Brost, F. Tomasi, M. Lalmas, Contextual and se- teractive Intelligent Systems (TiiS) 5 (2015) 1–19.
quential user embeddings for large-scale music recom- URL: https://doi.org/10.1145/2827872. doi:10.1145/
mendation, in: Proceedings of the 14th ACM Confer- 2827872, published: 22 December 2015.
ence on Recommender Systems (RecSys ’20), Associa-
tion for Computing Machinery, New York, NY, USA,