=Paper=
{{Paper
|id=Vol-2994/paper51
|storemode=property
|title=Unbiasing Collaborative Filtering for Popularity-Aware Recommendation (Discussion Paper)
|pdfUrl=https://ceur-ws.org/Vol-2994/paper51.pdf
|volume=Vol-2994
|authors=Luciano Caroprese,Giuseppe Manco,Marco Minici,Francesco Sergio Pisani,Ettore Ritacco
|dblpUrl=https://dblp.org/rec/conf/sebd/Caroprese0MPR21
}}
==Unbiasing Collaborative Filtering for Popularity-Aware Recommendation (Discussion Paper)==
Unbiasing Collaborative Filtering for
Popularity-Aware Recommendation
(Discussion Paper)
Luciano Caroprese1 , Giuseppe Manco1 , Marco Minici1 , Francesco Sergio Pisani1 and
Ettore Ritacco1
1
ICAR-CNR, Via Bucci, 8-9c, Rende (Italy)
Abstract
We analyze the behavior of recommender systems relative to the popularity of the items to recommend.
Our findings show that most popular ranking-based recommenders are biased towards popular items, thus
affecting the quality of recommendation. Based on these observations, we propose a new deep learning
architecture with an improved learning strategy that significantly improves the performance of such
recommenders on low-popular items. The proposed technique is based on two main aspects: resampling
of negatives and ensembling of multiple instances of the algorithm. Experimental results on traditional
benchmark datasets show that the proposed approach substantially improves the recommendation
ability by balancing accurate contributions almost independently from the popularity of the items to
recommend.
Keywords
Recommender Systems, Collaborative Filtering, Deep Learning, Big Data
1. Introduction
The massive amount of information available to users in the form of digital catalogs and online
services allows users to access and consume online content, but at the same time it poses a choice
paradox. Purchasing items on e-commerce sites, selecting movies on streaming platforms or
connecting with other peers on social networks implies making choices among a huge amount
of elements. Recommender Systems play a crucial role within this context since they provide
users with a better experience through the recommendation of new content and data that users
are likely to appreciate.
Recommendations can have a disruptive impact: if on one side, they assist users in their
choices, on the other side they can influence the choices themselves. Indeed, Recommender
systems can favor particular categories of items or particular brands over others, thus introducing
a bias within the catalog [1]. A typical bias is intrinsic to the nature of recommendation. In
systems involving interaction between significant amounts of users and items, we observe the
presence of very few very popular items and many less popular others. This distribution follows
the so-called 80-20 rule that refers to the fact that 80% of users express preferences for only 20%
SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy
Envelope-Open luciano.caroprese@icar.cnr.it (L. Caroprese); giuseppe.manco@icar.cnr.it (G. Manco); marco.minici@icar.cnr.it
(M. Minici); francescosergio.pisani@icar.cnr.it (F. S. Pisani); ettore.ritacco@icar.cnr.it (E. Ritacco)
Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
of the available items. In practice, user preferences follow a long tail distribution. Recommender
Systems based on collaborative filtering [2], who tend to characterize user preferences from
interactions, are affected by a relevant problem: they tend to suggest very popular items and
neglect less popular ones. This way, there is a reinforced effect because the most popular items
will become more and more popular than the less known ones. This phenomenon is called
popularity bias.
The quality of the recommendations is inevitably affected by the popularity bias, since most
recommendations are prone to contain objects that the user will consider trivial and well known.
By contrast, the capability of recommending items belonging to the long tail (i.e., less popular)
can disclose new perspectives: niche items, little-known objects, hidden gems that satisfy the
userβs preferences can greatly improve user engagement, provided that the recommended items
are consistent with userβs taste.
The current literature has extensively studied the problems of fairness and bias [3, 4, 5] within
recommender systems, with particular emphasis to popularity bias [6, 7, 8]. Notably, in [9], it is
shown that unfair recommendations are concentrated on groups of users interested in long-tail
and less popular items. In this paper we focus our attention on recommender sytems based on
ranking [10, 11], which are extremely flexible and as a consequence particularly interesting to
unbias. Typical approaches [12, 13] consider techniques to increase the representation of less
popular items, by either post-processing techniques or constraining the recommendation score.
Despite such recent efforts, unbiasing the recommendation from popular items is still an open
problem.
In this paper we devise a ranking-based recommender system for implicit feedback (RVAE),
based on a variational autoencoder architecture. The proposed model is a substantial extension of
the MVAE model proposed in [14] and in fact it inherits its accuracy and computational efficiency.
We analyze the behavior of RVAE and show that the model is characterized by the popularity
bias problem. We then propose an experimental study of specific techniques to overcome it.
The proposed technique is based on two main concepts: resampling/reweighting items and
ensembling of multiple instances of the algorithm. Our experiments show that these simple
strategies allow to unbias the algorithm and hence provide more effective recommendations.
2. Basic Framework
We start by setting the notation that we shall use throughout the paper. In the context of
collaborative filtering, π’ β π = {1, β¦ , π} indexes a user and π, π β πΌ = {1, β¦ , π } index items
for which the user can express a preference. We model implicit feedback, thus assuming a
preference matrix X β {0, 1}πΓπ , so that π₯π’,π = 1 whenever user π’ expressed a preference for
item π, and π₯π’,π = 0 otherwise. Also, xπ’ is the (binary) row indexed at π’, representing all the item
preferences for user π’. Given xπ’ , we define πΌπ’ = {π β πΌ |π₯π’,π = 1} (with ππ’ = |πΌπ’ |). The preference
matrix induces a natural ordering between items: π βΊπ’ π has the meaning that π’ prefers π to π, i.e.
π₯π’,π > π₯π’,π in the rating matrix. Our objective is to devise a model for such an ordering.
Preference Modeling. We consider a general framework where preferences are modeled as
the effect of latent factors ultimately characterizing users and/or items. We shall consider two
basic instantiations of this general idea, and will provide a unified framework.
The first situation we consider is the Multinomial Variational Autoencoder (MVAE) framework
proposed in [14]. Within this framework, for a given user π’ the related xπ’ is modeled as the
effect of a multinomial distribution governed by a prior z, i.e.
xπ’ βΌ Discrete (π(z))
π(z) β exp {ππ (z)}
Here, ππ (β
) represents a neural network parameterized by π. The latent variable z is modeled by
a prior π(z) (typically a gaussian distribution). Thus, the probability of preferences for a given
user can be expressed as
π(xπ’ ) = β« π(xπ’ |z)π(z) dz
Due to the intractability of the above integral, [15] devise a variational approach based on a
proposal π(z|xπ’ ) that approximates the posterior distribution. Again, π is modeled as a gaussian
distribution
π(z|xπ’ ) = π© (z; ππ’ , ππ’ ),
where ππ’ , ππ’ = ππ (xπ’ ) and ππ is a neural network parameterized by π. By exploiting the inequality
log π(xπ’ ) β₯ πΌzβΌπ(β
|xπ’ ) [log π(xπ’ |z) β π(z)]
we can finally learn the π, π parameters by optimizing the loss
βππ π΄πΈ (π, π) = β {πΌzβΌππ (β
|xπ’ ) [log ππ (xπ’ |z)] β ππ[ππ (z|xπ’ )βπ(z)]}
π’
The overall framework is based hence on regularized encoder-decoder scheme, where ππ (z|xπ’ )
represents the encoder, ππ (xπ’ |z) represents the decoder and the term πΌzβΌππ (β
|xπ’ ) [π(z)] acts as a
regularizer. In the training phase, for each π’ a latent variable z βΌ ππ (β
|xπ’ ) is devised. Next, z is
exploited to devise the probability ππ (xπ’ |z). Users with low probability are penalized within the
loss and the network parameters can be updated accordingly.
Prediction for new items is accomplished by resorting to the learned functions ππ and ππ :
given a (partial) user history xπ’ , we compute z = ππ’ and then devise the probabilities for the
whole item set through π(z). Unseen items can then be ranked according to their associated
probabilities.
The second formulation we consider is inspired by the Bayesian Personalized Ranking (BPR)
model introduced in [10]. The idea underlying this model is that a preference π βΊπ’ π can be
directly explained as closeness in a latent space where both items and users can be mapped.
Mathematically this can be devised by computing a factorization rank pππ’ qπ for each pair (π’, π),
and modeling preferences by means of a Bernoulli process:
π βΊπ’ π βΌ Bernoulli(π)
π = π (pππ’ (qπ β qπ ))
where π(π) = (1 + π βπ )β1 is the logistic function. The optimal embeddings P and Q can hence
be obtained by opimizing the loss
βπ΅ππ
(P, Q) β β β log π (pππ’ (qπ β qπ )
π’ π,π
πβΊπ’ π
We combine the two frameworks by adapting the BPR loss to the MVAE model. In particular,
instead of modeling π(xπ’ |z), we directly model π(π βΊπ’ π|z) within a similar variational framework.
In short, the current preferences are encoded within a latent variable z that is further exploited
to decode all ranks:
π βΊπ’ π βΌ Bernoulli(ππ,π )
ππ,π = π (ππ β ππ )
π = ππ (z)
Here, π represents the output of a neural network parameterized by π. For a given item π, the
value ππ represents then the associated rank which can be used sort all preferences. The model
can be obtained by optimizing the loss:
βπ
π π΄πΈ (π, π) = β {πΌzβΌππ (β
|xπ’ ) [β log ππ (π βΊπ’ π|z)] β ππ[ππ (z|xπ’ )βπ(z)]} (1)
π’ π,π
πβΊπ’ π
We call this model Ranking Variational Autoencoder (RVAE).
Learning by Negative Sampling. In the above formulation there are some details worth
further discussion. When learning the RVAE model, optimizing the likelihood requires that
all pairs of items are considered within Eq. (1). This is unrealistic with large item bases, and
it is usually customary to only consider a subset π«π’ β {(π, π)|π, π β πΌ ; π βΊπ’ π}. The sampling of
ππ’ is critical for determining the behavior of any predictive model; the most used approach
in literature is to uniformly sample, for each user π’ and item π (called positive item), a fixed
number of items {π1 , β¦ ππ } β πΌ β πΌπ’ with the underlying assumption that βπ βΆ π βΊπ’ ππ . Thus, Eq.
(1) can be rewritten as:
βπ
π π΄πΈ (π, π) = β {πΌzβΌππ (β
|xπ’ ) [ β log ππ (π βΊπ’ π|z)] β ππ[ππ (z|xπ’ )βπ(z)]} (2)
π’ (π,π)βππ’
This will be the basis loss upon which we develop our study next.
3. The Impact of Popularity
We start our analysis on the following popular benchmark datasets: i) Movielens, a time
series dataset containing user-item ratings pairs along with the corresponding timestamp; ii)
Pinterest, based on the social media that allows users to save or pin an image (item) to their
board. The dataset denotes as 1 the pinned images for each user; iii) CiteUlike, a dataset
obtained from the homonymous service which provides a digital catalog to save and share
1007 low-popular items (avg pop: 6) 1767 low-popular items (avg pop: 18)
2367 mid-popular items (avg pop: 164) 103 7620 mid-popular items (avg pop: 153)
103 142 top-popular items (avg pop: 1239) 529 top-popular items (avg pop: 568) 102 3783 low-popular items (avg pop: 4)
12693 mid-popular items (avg pop: 12)
504 top-popular items (avg pop: 61)
102
Popularity
Popularity
Popularity
102
101
101 101
100 100 100
0 0.5k 1.0k 1.5k 2.0k 2.5k 3.0k 3.5k 0 2k 4k 6k 8k 10k 0 2k 4k 6k 8k 10k 12k 14k 16k
Item Item Item
(a) ML1M (b) Pinterest (c) CiteUlike
Figure 1: Item popularity distributions.
academic papers. Within fig. 1 we plot all the items within the datasets, by increasing popularity.
For each dataset we identify three classes: low, mid and high popular items.
We then study the behavior of the RVAE model with respect to the popularity classes defined.
To do so, we adopt the following protocol. For each dataset, 70% of users are randomly sampled
with all userβs items. Each such user is associated with xπ’ and a set ππ’ of positive/negative
item pairs. In particular, we consider all positive items within xπ’ , and for each positive item π
we sample π = 4 negative items. The remaining 30% users are uniformly split into validation
and test. In particular, for each user π’ we consider a random subset ππ’ β πΌπ’ representing the
30% of the positive items, and ππ’ represents a subset of 100 negative items sampled from πΌ β πΌπ’ .
The vector xπ’ is masked to remove all elements in ππ’ . We then feed the masked xπ’ to obtain
the score vector ππ’ . Now, for a given cutoff value π, let us consider the π β 1 negative items for
π’ for which RVAE gives the highest score, and, among them the item π having the minimum
score. There is an hit with cutoff π, for the user π’ and the item π β ππ’ , if ππ’,π β₯ ππ’,π . Let π»π’π the
β π»π
number of hits for the user π’ with cutoff π. We define the Hit-Rate at π on π as HR@π = β π’βπ |π π’| .
π’βπ π’
We can trivially specialize this definition for items within the low, medium and high popularity,
by considering only items in ππ’ that belong to the specific class. The results of the evaluation
are summarized in table 1a. We can see that the model suffers mainly on low-popular items.
As a matter of fact, the overexposure of popular items is predominant and the model learn to
predict essentially those items. The fact is that popular items are easy to predict. However, it is
in the mid and especially on low popular items that the most interesting predictions can take
place: niche items are difficult to discover by an end user and hence their accurate suggestions
can greatly improve user engagement. The research question is hence: how can we boost the
model to improve the performance on low-popular items?
4. Unbiased Recommendation
In a simple experiment, we retrain the model by only considering pairs (π, π) β ππ’ such that
π is in the low popular class. We call this model RVAE πΏ . Compared to the results in table 1a,
the results for this restricted model (shown in table 1b) show that if attention is placed on low
popular items, their response on prediction accuracy can be improved. Similar results can be
observed for the mid popular items. Thus, in order to unbias the model we need to rebalance
HR@1 HR@5 HR@10
Dataset
Global Low Med High Global Low Med High Global Low Med High
Movielens 0.2510 0.00 0.16 0.43 0.5853 0.01 0.47 0.83 0.7443 0.05 0.65 0.94
Pinterest 0.2731 0.13 0.23 0.46 0.6992 0.46 0.66 0.86 0.8764 0.65 0.86 0.95
CiteULike 0.2875 0.07 0.22 0.67 0.6021 0.29 0.57 0.90 0.7638 0.48 0.75 0.95
(a) RVAE
HR@1 HR@5 HR@10
Dataset
Global Low Global Low Global Low
Movielens 0.0012 0.15 0.0042 0.50 0.0063 0.74
Pinterest 0.0089 0.43 0.0166 0.81 0.0192 0.93
CiteULike 0.0254 0.32 0.0561 0.71 0.0703 0.89
(b) RVAE πΏ
Table 1
Predictive accuracy.
the contribution on low (and mid) popular items with regards to the high popular ones. To
achieve this goal, we study three different strategies.
The first strategy consists in weighting, for each pair (π, π) β ππ’ , the contribution to the loss
with a factor inversely proportional to the popularity of the item π:
β1
πΎ(1+π πΌ(ππ βπ½β1) ) +1
π€π = πΎ +1
Here, ππ represents the number of occurrences of π, and πΌ, π½, πΎ the parameters representing the
steep, center and scale of the decay of high popular items. We experiment with πΌ = 0.01, π½
representing the average frequency of mid-popular items and πΎ = 100 and call this variant
RVAE π . The above strategy has the advantage of reweighing the contributions of low and mid
popular items with respect to high popular ones: the ratio between the most popular and the
lowest popular is approximately 1/πΎ. However, this weighting scheme has a main disadvantage.
The weight is relative to a positive item π, but it is associated with pairs (π, π) β ππ’ . That is,
besides weighting the contribution of π, this scheme also overexposes (or dually underexposes)
the contribution of the negative item π. To avoid this, an alternative strategy consists in changing
the sampling scheme that produces ππ’ . In practice, rather than uniformly sampling, for each
positive item, a fixed number π of negative items, we can apply an inversely stratified sampling
where ππ negatives are sampled, with ππ being inversely proportional to the popularity of the
max(f)
item: ππ = π β
β π β.
π
The ratio with the above formula is to provide the same visibility to each positive item in the
loss function. Thus, the most popular item will be associated with exactly π pairs. By contrast,
low and mid popular items will be overexposed in the comparison. We call this variant RVAE π .
The third strategy consists in combining the baseline RVAE model with RVAE πΏ . For a user π’,
let ππ΅ the the score vector produced by RVAE, and ππΏ the score produced by RVAE πΏ . We define
RVAE πΈ and the model that produces the score ππΈ defined as ππΈ = Softmax(ππ΅ ) + πΏm β
Softmax(ππΏ ).
The vector m masks all items but the low popular. The scores are normalized (via the Softmax
function) to make the two models comparable. Finally, πΏ is a weight aimed at tuning the boost
for the low popular items, as devised by RVAE πΏ . We experimentally found an optimal tuning
with πΏ = 0.4.
HR@1 HR@5 HR@10
Global Low Med High Global Low Med High Global Low Med High
RVAE 0.2510 0.00 0.16 0.43 0.5853 0.01 0.47 0.83 0.7443 0.05 0.65 0.94
Movielens
RVAE π 0.0854 0.01 0.13 0.01 0.2958 0.07 0.40 0.10 0.4661 0.11 0.59 0.23
RVAE π 0.1245 0.06 0.12 0.13 0.3875 0.16 0.38 0.40 0.5825 0.28 0.57 0.61
RVAE πΈ 0.2466 0.05 0.16 0.42 0.5506 0.22 0.43 0.80 0.6965 0.37 0.59 0.91
RVAE 0.2731 0.13 0.23 0.46 0.6992 0.46 0.66 0.86 0.8764 0.65 0.86 0.95
Pinterest
RVAE π 0.2196 0.29 0.23 0.18 0.6544 0.68 0.66 0.64 0.8592 0.80 0.86 0.87
RVAE π 0.2482 0.11 0.23 0.35 0.6770 0.45 0.65 0.80 0.8664 0.64 0.86 0.93
RVAE πΈ 0.2726 0.26 0.22 0.46 0.6954 0.63 0.65 0.86 0.8620 0.80 0.84 0.94
RVAE 0.2875 0.07 0.22 0.67 0.6021 0.29 0.57 0.90 0.7638 0.48 0.75 0.95
CiteUlike
RVAE π 0.2711 0.09 0.25 0.46 0.6156 0.37 0.61 0.76 0.7698 0.55 0.77 0.87
RVAE π 0.3890 0.32 0.39 0.43 0.7264 0.61 0.73 0.76 0.8399 0.73 0.84 0.90
RVAE πΈ 0.2805 0.16 0.21 0.66 0.5634 0.50 0.50 0.87 0.6941 0.71 0.64 0.92
Table 2
Comparative analysis.
Table 2 summarizes the results of the evaluation. We see that, in general, all the strategies
considerably improve the response of the model to the low popular items. However, RVAE π has
a low response on the high popular items. By contrast, RVAE π succeeds in boosting performance
on both the low popular and the mid popular items. Overall, the ensemble RVAE πΈ provides the
best response, by boosting low popular items without substantially degrading over the other
classes.
5. Conclusions
The approach proposed in this paper is a preliminary study: We introduce a Ranking collabo-
rative filtering algorithm (RVAE) and study how the algorithm is affected by popularity bias.
Next, we show how simple techniques based on reweighting/resampling and/or ensembling
can recalibrate the recommendations. There are several aspects that are worth further investi-
gation. First of all, both the weighting and the inverse stratified sampling schemes are based on
hyperparameters that need to be carefully tuned. Also, the ensemble strategies are simple and
more complex schemes that also take into account other model instantiations can be studied.
We reserve the attention to these challenges in a future work.
References
[1] V. Tsintzou, E. Pitoura, P. Tsaparas, Bias disparity in recommendation systems, CoRR
abs/1811.01461 (2018). URL: http://arxiv.org/abs/1811.01461.
[2] C. C. Aggarwal, Recommender Systems, Springer, 2016.
[3] B. Friedman, H. Nissenbaum, Bias in computer systems, ACM Trans. Inf. Syst. 14 (1996)
330β347.
[4] Z. Zhu, X. Hu, J. Caverlee, Fairness-aware tensor-based recommendation, in: ACM
International Conference on Information and Knowledge Management, CIKM β18, 2018, p.
1153β1162.
[5] Y. Deldjoo, V. W. Anelli, H. Zamani, A. Bellogin, T. Di Noia, A flexible framework for
evaluating user and item fairness in recommender systems, User Modeling and User-
Adapted Interaction (2021).
[6] O. Celma, P. Cano, From hits to niches?: Or how popular artists can bias music recom-
mendation and discovery, in: 2nd KDD Workshop on Large-Scale Recommender Systems
and the Netflix Prize Competition, 2008.
[7] H. Steck, Item popularity and recommendation accuracy, in: Proceedings of the Fifth
ACM Conference on Recommender Systems, RecSys β11, 2011, p. 125β132.
[8] R. Borges, K. Stefanidis, On measuring popularity bias in collaborative filtering data, in:
EDBT Workshop on BigVis 2020: Big Data Visual Exploration and Analytics, 2020.
[9] H. Abdollahpouri, M. Mansoury, R. Burke, B. Mobasher, The unfairness of popularity bias
in recommendation, CoRR abs/1907.13286 (2019). URL: http://arxiv.org/abs/1907.13286.
[10] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme, Bpr: Bayesian personalized
ranking from implicit feedback, in: Conf. on Uncertainty in Artificial Intelligence, UAI β09,
2009, pp. 452β461.
[11] T. Ebesu, B. Shen, Y. Fang, Collaborative memory network for recommendation systems,
in: ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR
β18, 2018.
[12] Z. Zhu, J. Wang, J. Caverlee, Measuring and mitigating item under-recommendation bias in
personalized ranking systems, in: ACM SIGIR Conference on Research and Development
in Information Retrieval, SIGIR β20, 2020, p. 449β458.
[13] H. Abdollahpouri, R. Burke, B. Mobasher, Managing popularity bias in recommender
systems with personalized re-ranking, CoRR abs/1901.07555 (2019). URL: http://arxiv.org/
abs/1901.07555.
[14] D. Liang, R. G. Krishnan, M. Hoffman, T. Jebara, Variational autoencoders for collaborative
filtering, in: ACM Conf, on World Wide Web, WWW β18, 2018, pp. 689β698.
[15] D. P. Kingma, M. Welling, Auto-encoding variational bayes, in: 2nd
International Conference on Learning Representations, ICLRβ14, 2014.
arXiv:http://arxiv.org/abs/1312.6114v10.