1. Introduction

P. Baldi, Z. Lu, Complex-valued autoencoders, Neural Networks

10.1145/2983323.2983758

Comparing Recommendation Losses under Negative Sampling ⋆

Giulia Di Teodoro

G@0.4 giulia.di.teodoro@ing.unipi.it 1

Federico Siciliano

siciliano@diag.uniroma1.it 0

Nicola Tonellotto

nicola.tonellotto@unipi.it 1

Fabrizio Silvestri

fsilvestri@diag.uniroma1.it 0 0 Department of Computer, Control and Management Engineering, Sapienza University of Rome , Italy 1 Information Engineering Department, University of Pisa , Pisa , Italy

2018

33 2012 227 236

Loss functions, such as categorical cross-entropy (CCE), binary cross-entropy (BCE), and Bayesian personalized ranking (BPR), play a central role in training modern recommender systems. Although evaluations are often based on ranking metrics, such as Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR), a direct understanding of how these losses relate to target metrics remains incomplete. Furthermore, full-item training is computationally prohibitive, which has led to the widespread use of negative sampling. In this extended abstract, we (i) derive theoretical equivalences and bounds relating these loss functions under negative sampling; (ii) prove that BPR and CCE become identical under a single negative sample; and (iii) show that BCE provides the tightest bound on NDCG and MRR when negative sampling is used. We complement our theoretical findings with empirical results on five datasets and four neural architectures, which consistently validate the theory.

eol>Recommender Systems Loss Functions Negative Sampling Ranking Metrics

1. Introduction

2. Theoretical Analysis negative (non-interacted) items. Let (, ) denote the model’s score for item .

We consider a user and a set of items ℐ, where ⊂ ℐ are positive (interacted) items and are 2.1. Loss Definitions under Sampling CCE - Categorical cross-entropy: BCE - Binary cross-entropy: BPR - Bayesian personalized ranking:

where + ∈ and − ∈ . 2.2. Ranking Metrics ℒCCE = − log

(,+) (,+) + ∑︀

=1 (,− ) ℒBCE = − log (︀ (, +))︀ −

=1 ︁(

︁) ∑︁ log 1 − ((, − )) ℒBPR = −

∑︁ log =1 ︁(

(, +) − (, − )︁) The Normalized Discounted Cumulative Gain (NDCG) is a widely used recommendation metric that accounts for the graded relevance of items depending on their position in the ranked list: (+) =

1 2(1 + +) (+) = 1 + if there’s only one relevant item and + is its rank position. the rank position + of the first relevant item in the recommendations:

Another key ranking measure is the Mean Reciprocal Rank (MRR), which computes the inverse of 2.3. Equivalence of BPR and CCE Under a single negative sample ( = 1), we prove that ℓ is equivalent to ℓ . Proposition 1. ℓ = ℓ if one negative item = 1 is sampled for each user.

This highlights that, when sampling only one negative per positive, optimizing CCE or BPR leads to the same parameter updates. 2.4. Equivalence of Global Minima We now present a result that establishes the equivalence of the global minima of the three loss functions when a single negative is sampled and item scores are bounded.

Proposition 2. If +, ∈ [− , ] for some > 0, then: + + + arg min ℓ = arg min ℓ = arg min ℓ = arg min ℓ = arg min ℓ = arg min ℓ = −

This proposition implies that, under bounded scores and single negative sampling, BPR, BCE, and CCE converge to the same optimal solution. Practically, it means that the choice of loss function does not afect the ideal parameter configuration.

However, in deep neural networks, these extreme score values are rarely reached due to regularization, early stopping, and model inductive biases, which prevent overfitting and favour generalization [ 18, 19]. Hence, while useful, this result has limited applicability to real-world RS training scenarios. 2.5. Bounding Ranking Metrics We now turn to the comparison of ranking losses from the perspective of their ability to upper bound ranking metrics, particularly − log(NDCG), under uniform negative sampling.

Theorem 1. When uniformly sampling negative items, in the worst-case scenario, and + ≥ 0: P(− log NDCG(+) ≤ ℓ ) ≥ P(− log NDCG(+) ≤ ℓ ) ≥

P(− log NDCG(+) ≤ ℓ )

This result shows that BCE ofers the tightest bound on NDCG among the three losses, followed by BPR and then CCE. While the exact behaviour depends on the rank + of the positive item and the number of sampled negatives , BCE consistently exhibits more favourable properties, especially when item embeddings remain well-distributed, avoiding embedding collapse [20].

That said, practical dynamics during training, such as changing item ranks, diferences in optimization behaviour between losses, and embedding concentration due to popularity bias, can afect these bounds. Thus, while BCE is theoretically preferable, its advantage may vary in real-world scenarios.

Additional theorems, full proofs, and the extension to MRR can be found in the original work [ 1 ]. 3. Empirical Evaluation We validate our theoretical insights on five benchmarks (MovieLens-1M[ 21], Amazon-Beauty[22], Amazon-Books[22], Yelp [23], and Foursquare NYC [24]) and four architectures (matrix factorization[25], Self-attentive Sequential Recommendation (SASRec) [26], GRU4Rec[27], and LightGCN[28]). For each setting, we vary ∈ {1, 5, 10, 20} negatives per positive and measure NDCG@10 and MRR [29]. 3.1. Efect of Negative Sampling 0.6 100 101 102 100 101 102 100 101 102 Epoch

Epoch

Epoch (a) BCE – GRU4Rec (b) BCE – SASRec (c) BCE – GRU4Rec (Foursquare)

We analyze how varying the number of negative items afects training on ML-1M using BCE. As shown in Fig. 1, fewer negatives yield faster improvements in early epochs, while a larger number (e.g., 100) leads to slower starts but better final performance. This reflects a trade-of: fewer negatives ease 0.6 0.5 010.4 @ CG0.3 D N0.2 0.1 early learning, but more negatives improve generalization by providing harder contrasts. For BPR and CCE, we observe similar trends with slightly more stable early-phase training (see complete results in the original paper). 3.2. Loss Comparison: 1 vs 100 Negatives 100 101

102 Epoch 100 101 102 100 101

102 Epoch

Epoch (a) ML-1M (1 negative) (b) ML-1M (100 negatives) (c) Foursquare (100 negatives) 0.7 0.6 00.5 1 @0.4 G CD0.3 N 0.2 0.1 0.6 0 1 @0.4 G C D N0.2 100 101 102 100 101 102 103 100 101 102 Epoch

Epoch

Epoch (a) ML-1M (1 negative) (b) ML-1M (100 negatives) (c) Foursquare (100 negatives)

Figs. 2 and 3 compare loss functions using 1 and 100 negative samples. With a single negative, BPR and CCE perform identically on SASRec, as predicted by theory. BCE shows superior final performance, confirming its tighter bound to ranking metrics. On GRU4Rec, diferences between losses are smaller.

When using 100 negatives, CCE generally performs better than BPR early in training, while BCE starts slower but steadily improves, surpassing both losses in later epochs. On Foursquare (Figs. 2c and 3c), BCE again starts behind but shows strong late-phase gains. However, due to its slower convergence, CCE often remains the most stable choice in early-to-mid training. 4. Conclusion and Future Work We presented a unified theoretical framework that (i) links popular recommendation losses under negative sampling, (ii) uncovers an equivalence between BPR and CCE for a single negative, and (iii) establishes BCE as the preferred surrogate for ranking metrics. Future directions include extending the analysis to dynamic sampling schemes and to include gradient descent dynamics.

Acknowledgments

This work was partially supported by projects FAIR (PE0000013) and SERICS (PE00000014), under the MUR National Recovery and Resilience Plan funded by the European Union - NextGenerationEU, and project NEREO (Neural Reasoning over Open Data), funded by the Italian Ministry of Education and Research (PRIN) Grant no. 2022AEFHAZ.

Declaration on Generative AI During the preparation of this work, the authors used GPT-4 and DeepL for grammar and spelling check. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

[1]

Di Teodoro ,

Siciliano ,

Tonellotto ,

Silvestri , A theoretical analysis of recommendation loss functions under negative sampling , 2025 International Joint Conference on Neural Networks (IJCNN) ( 2025 ).

[2]

Rendle ,

Freudenthaler ,

Gantner ,

Schmidt-Thieme , Bpr: Bayesian personalized ranking from implicit feedback , in: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence , UAI '09, AUAI Press, Arlington, Virginia, USA, 2009 , p. 452 - 461 .

[3]

A. V.

Petrov , C. Macdonald, gsasrec: Reducing overconfidence in sequential recommendation trained with negative sampling , in: Proceedings of the 17th ACM Conference on Recommender Systems , RecSys '23, Association for Computing Machinery, New York, NY, USA, 2023 , p. 116 - 128 . URL: https://doi.org/10.1145/3604915.3608783. doi: 10 .1145/3604915.3608783.

[4]

Xu ,

Zhu ,

Wang ,

Wang , W. Zhang, Understanding the role of cross-entropy loss in fairly evaluating large language model-based recommendation , 2024 . arXiv: 2402 . 06216 .

[5]

Wu ,

Wang ,

Gao ,

Chen ,

Fu , T. Qiu, On the efectiveness of sampled softmax loss for item recommendation , ACM Trans. Inf. Syst . 42 ( 2024 ). URL: https://doi.org/10.1145/3637061. doi: 10 .1145/3637061.

[6]

Bruch ,

Wang ,

Bendersky ,

Najork , An analysis of the softmax cross entropy loss for learning-to-rank with binary relevance , in: Proceedings of the 2019 ACM SIGIR international conference on theory of information retrieval , 2019 , pp. 75 - 78 .

[7]

Pu ,

Chen ,

Huang ,

Chen ,

Lian ,

Chen , Learning-eficient yet generalizable collaborative filtering for item recommendation , in: Forty-first International Conference on Machine Learning (ICML) , 2024 .

[8]

Yang ,

Chen ,

Xin ,

Zhou ,

Hu ,

Feng ,

Chen ,

Wang , Psl: Rethinking and improving softmax loss from pairwise perspective for recommendation , in: The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024 .

[9]

Bacciu ,

Siciliano ,

Tonellotto ,

Silvestri , Integrating item relevance in training loss for sequential recommender systems , in: Proceedings of the 17th ACM Conference on Recommender Systems , 2023 , pp. 1114 - 1119 .

[10]

Siciliano ,

Lagziel , I. Gamzu, G. Tolomei, Robust training of sequential recommender systems with missing input data ( 2024 ).

[11]

Järvelin ,

Kekäläinen , Cumulated gain-based evaluation of ir techniques , ACM Trans. Inf. Syst . 20 ( 2002 ) 422 - 446 . URL: https://doi.org/10.1145/582415.582418. doi: 10 .1145/582415.582418.

[12]

Weston ,

Bengio ,

Usunier , Wsabie: scaling up to large vocabulary image annotation , in: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three, IJCAI'11 , AAAI Press, 2011 , p. 2764 - 2770 .

[13] C. J. C. Burges, From RankNet to LambdaRank to LambdaMART: An Overview , Technical Report, Microsoft Research , 2010 . URL: http://research.microsoft.com/en-us/um/people/cburges/tech_ reports/MSR-TR-2010-82.pdf.

[14]

Yuan , G. Guo,

J. M.

Jose , L. Chen,

Yu ,

Zhang , Lambdafm: Learning optimal ranking with factorization machines using lambda surrogates , in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management , CIKM '16, Association for Computing