=Paper= {{Paper |id=Vol-2578/BigVis11 |storemode=property |title=On Measuring Popularity Bias in Collaborative Filtering Data |pdfUrl=https://ceur-ws.org/Vol-2578/BigVis11.pdf |volume=Vol-2578 |authors=Rodrigo Borges,Kostas Stefanidis |dblpUrl=https://dblp.org/rec/conf/edbt/BorgesS20 }} ==On Measuring Popularity Bias in Collaborative Filtering Data== https://ceur-ws.org/Vol-2578/BigVis11.pdf
 On Measuring Popularity Bias in Collaborative Filtering Data
                               Rodrigo Borges                                                                 Kostas Stefanidis
                            rodrigo.borges@tuni.fi                                                    konstantinos.stefanidis@tuni.fi
                             Tampere University                                                            Tampere University
                              Tampere, Finland                                                              Tampere, Finland

ABSTRACT                                                                               that its success is measured by the number of right guesses it
The effect of having few data items responsible for the majority of                    makes in an separated part of the data (test set) after adjusting
ratings in a Collaborative Filtering recommendation, and the com-                      its weights. Let’s assume that after a big number of rounds of
plement of having majority of items responsible for few ratings                        recommendations, 10% of the available items were responsible for
given by the users, are usually referred as popularity bias. The                       more than 30% of users interactions registered by the platform.
effect is known as reflecting the preference of users for popular                      In its next train procedure, the algorithm will try to adjust its
items, but also as a consequence of methods and metrics normally                       parameters to maximize its overall accuracy, which will certainly
applied by these systems. Variational Autoencoders (VAE) are                           account mostly for those 10% items than for unpopular ones
considered today the state-of-the-art for collaborative filtering                      responsible, for example, for 0.5% of the play counts. We imagine
recommenders, and can handle big and sparse data entries with                          this happening successively, and in each round the model is more
robustness and high accuracy. A methodology is proposed here                           adjusted according to popular items, and unaware of a great slice
for characterizing the popularity bias in Movielens and Netflix                        of items that could potentially found their niches of consumption.
datasets, and when applying VAE for generating recommenda-                                 We assume the popularity bias effect as a mixture of unbal-
tions based on them. As a first step, the long tail model is applied                   anced preferences authentically expressed by the users, and a side
for segmenting items and users in three different classes (Short                       effect of algorithms and metrics applied in the current systems.
Head, Medium Tail and Long Tail), depending on the proportion                          Apart from that, suggesting unpopular items has the desired
of interactions they are associated with. In addition, a real recom-                   effect of serendipity (providing users with novelty), and also ex-
mendation scenario is presented for measuring the proportion                           pand the knowledge of the system about unpopular items with
of unpopular items appearing among the suggestions provided                            very sparse rating information [21].
by VAE. We consider characterizing the popularity in details as                            We propose a methodology for identifying and retrieving the
a very first step for providing recommenders with the desired                          bias contained in collaborative filtering popular datasets. Interac-
serendipity effect, and expanding the knowledge of these systems                       tions between user and items composing Movielens and Netflix
about new and unpopular items with few ratings.                                        datasets were represented as static data for identifying popularity.
                                                                                       They were ordered by the number of occurrences and segmented
KEYWORDS                                                                               in Short Head, Medium Tail and Long Tail. First gathered by
                                                                                       items and then by users.
Recommendations; Bias; Collaborative filtering; Variational Au-
                                                                                           Variational Autoencoder (VAE) is a dimensionality reduction
toencoders
                                                                                       technique considered today as the state-of-the-art for the task of
                                                                                       CF [15, 18]. It represents user interaction as normal distributions
1    INTRODUCTION                                                                      in a latent space, with great power of predicting unseen item
A huge amount of data is produced, converted to digital format                         ratings. We are here interested in tracking how prone to bias
and published online each day, from scientific articles to potential                   this solution is. We conduct standard procedures for training
romantic partners information. Nevertheless, the amount of time                        and testing it, and measure the proportion of each popularity
users have available to spend browsing in platforms is severely                        categories presented in the results.
limited if compared to the size of most of these catalogs. This                            The rest of the paper is structured as follows. In Section 2, we
motivated the development of recommender systems (RS), pro-                            present a detailed analysis of both datasets focusing on popularity
posed for presenting to users a subset of items he/she will most                       bias. In Section 3, we provide a metric for retrieving how users are
likely react positively.                                                               prone to each class of items, named Mainstreaminess. In Section 4,
   The most popular approach for implementing RS is called col-                        we simulate a real scenario of recommendations for checking
laborative filtering (CF), and relies in a premise that users who                      how biased the state-of-the-art collaborative filtering approach
interacted with items similarly in the past (e.g., bought many                         is. We conclude and point our future works in Section 6.
common books) are similar to each other. Once they shared pre-
vious decisions, they are assumed as maintaining their behavior                        2    MEASURING POPULARITY BIAS
and sharing also future ones. CF solutions explore this assump-                        We assume here the popularity bias effect as a proxy of having few
tion suggesting to each user the items his/her neighbors, i.e.,                        data items responsible for the majority of ratings in a dataset, and
users with similar behavior, consumed, but that he/she has not                         the complementary effect of having majority of items responsible
had contacted yet.                                                                     for very few ratings given by the users.
   Let’s assume a scenario in which a recommender operates                                In order to demonstrate the effect of popularity bias in a real
through an algorithm trained according to an error-based metric                        world scenario, we selected two datasets widely used in recom-
(as most of them really do) [3]. By error-based metric, we mean                        mender systems research field, both describing movies consump-
Β© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-   tion: the first one is provided by Movielens1 , and the second one
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,       by Netflix [4]. A summary of the characteristics of the datasets
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
tribution 4.0 International (CC BY 4.0)
                                                                                       1 https://grouplens.org/datasets/movielens/
(events2 , users, items and sparsity) is presented in the first 5
columns of Table 1.
   We follow demonstrating the bias effect in the datasets by
applying the well known long tail model [17], in which items are
ordered according to the number of events associated to them.
The main point here is to visualize how events are concentrated
in few popular items, entitled short head (SH), and how the re-
maining events are spread over the rest of them, known as long
tail (LT). The long tail items can be even separated in two parts
considering an intermediate proportion of them between popu-
lar, called medium tail (MT), and unpopular items. The events
distribution is then segmented in three regions according to [1],
who suggests 20% and 80% as the thresholds between SH and MT,
and between MT and LT, respectively.
   It is important to notice that the popularity bias can be ad-
dressed from two different perspectives, when considering items
or users as responsible for the majority of samples. We conduct
both analysis as a matter of comparison.

2.1     Item Bias
We start by applying the long tail model to each dataset by or-
dering items according to number of events associated to them,
as presented in Figure 1. It is possible to see the decaying effect
when moving from popular items to unpopular ones (from left
to right), and the three regions, defined by vertical dashed lines,
corresponding to SH, MT and LT . The x axis of the plots are
maintained linear while the y one is converted to logarithmic for
the sake of visibility.
   The unbalanced effect of consumption becomes clear when
analyzing the curves. Netflix data items distribution presents                     Figure 1: Lin-log plots for MovieLens (top) and Netflix (bot-
a wider MT region and a smoother decay than in the case of                         tom) datasets interactions distribution grouped by items.
Movielens. Extremely unpopular items (the ones surrounding                         The thresholds of 20% and 80% of the total summation are
the 1 value in y axis) represent approximately 15% of the first                    indicated in vertical dashed lines.
dataset, and are not observed in the second one.
   Fitting these distributions to power law is useful applying a
general model [21]. The same data is also presented in the log-log
format in Figure 2. According to [9], a quantity π‘₯ obeys a power                   is more prone to bias, having 92.3% of its movies corresponding
law if it is drawn from a probability distribution 𝑝 (π‘₯) ∝ π‘₯ βˆ’π›Ό .                  to less than 80% of its online activities. When considering the
Instead of a simple power law distribution, we have detected that                  popular ones, it presents also the smallest SH, with just 0,4% of
a best overall fitting results occur when considering its shifted                  the items responsible for 20% of all user activity. Netflix data
version, 𝑝 (π‘₯) ∝ (π‘Ž + 𝑏π‘₯) βˆ’π›Ό . The Kolmogorov-Smirnov goodness                     presents considerably larger MT, i.e., 10,9% of all movies watched
of fit for Movielens items is 0.67 (𝛼 = 1.2), and for Netflix is 0.38              in the platform.
(𝛼 = 1.7).
   The general comparison of three datasets is presented in Ta-                    2.2    User Bias
bles 1 and 2. The first relevant information to be mentioned is
                                                                                   We analyze the bias of our datasets, considering the complemen-
the number of interactions an item received in order to be con-
                                                                                   tary effect of having few items concentrating the majority of
sidered in each class of the long tail model. In Columns 6 and 7
                                                                                   interactions in an on-line service, which is the effect of having
of Table 1, it is possible to notice that in order to be considered
                                                                                   few and very active users along with very sparse ones who rarely
a popular movie in the Movielens dataset, a movie should sum
                                                                                   provides feedback to the system. The same methodology is repli-
more than 23,301 play counts, and in order to belong to LT it
                                                                                   cated here, but now considering the effect happening because of
should have had been watched less than 2,140 times. When it
                                                                                   different reasons.
comes to Netflix items the situation changes, a popular movie
                                                                                       A similar decaying effect is observed provoked by the different
now accounts for more than 101,061, and an unpopular one for
                                                                                   behavior of users, as one can see in Figure 3. This time Movielens
less than 9,635 interactions.
                                                                                   curve have smother decaying then in the case of items, and the
   The next interesting information to be highlighted is the ac-
                                                                                   proportions associated to the three consumption categories seems
tual sizes of SH, MT and LT regions in data distributions. Table 2
                                                                                   more homogeneous than before.
shows the general information about this segmentation, indicat-
                                                                                       None of Movielens users watched nearly zero movies, and
ing the proportion of items belonging to each popularity class.
                                                                                   it is possible to observe sharp slopes in SH and LT regions in
When considering extremely unpopular items, Movielens dataset
                                                                                   the Netflix distribution curve. But the reasons that may have
2 We are here considering an event as one single line in the dataset, containing   provoked these discontinuities goes beyond the scope of this
information of user id, movie id and timestamp.                                    article.
                                                       Table 1: Datasets description.

             Dataset        #Events       #Users     #Items     Sparsity   20% Items     80% Items       20% Users   80% Users
            MovieLens      20,000,263     138,493     26,744     0.54%       23,301        2,140            775         100
             Netflix       100,325,382    477,412     17,768     1.18%      101,061        9,635            966         178




Figure 2: Log-log plots for MovieLens (top) and Netflix                    Figure 3: Lin-log plots for MovieLens (top) and Netflix (bot-
(bottom) datasets interactions grouped by items.                           tom) datasets interactions distribution grouped by users.
                                                                           The thresholds of 20% and 80% of the total summation are
                                                                           indicated in vertical dashed lines.
                 Table 2: Item Bias Summary

   Dataset       Short Head      Medium Tail         Long Tail                A general overview of the analysis of how unbalanced users
  MovieLens       118 (0,4%)      1,831 (6,8%)      24,677 (92,3%)         interactions occur is presented in Table 3. SH and MT proportions
   Netflix        154 (0.8%)     1,938 (10,9%)      15,522 (87,4%)         are generally bigger when compared to the previous case. And
                                                                           as a direct consequence, less users are considered sparse in both
                                                                           cases.

   The log-log graphs for the user-oriented analysis is presented                          Table 3: User Bias Summary
in Figure 4. The fitting indexes to the theoretical distribution are
0.42 (𝛼 = 0.7), 0.37 (𝛼 = 1.0) for Movielens and Netflix respectively,       Dataset      Short Head      Medium Tail         Long Tail
indicating stronger evidences that these data can be modeled by             MovieLens     3,261 (2,4%)    48,909 (35,3%)     83,062 (60%)
the shifted power law distribution.                                          Netflix       14,444 (3%)    149,905 (31,4%)   298,619 (62,5%)
   The description of the user-based analysis is presented in Ta-
bles 1 and 3. In order to be considered a heavy user of Movielens,
and consequently belonging to its SH proportion of the user dis-
tribution, someone should have watched more than 775 movies,               3   MEASURING MAINSTREAMINESS
and for the case of sparse users, less than 100 movies. The person         To conclude our analysis of users for the datasets, we introduce
who watched more than 966 movies in Netflix is considered a                a metric named mainstreaminess, for retrieving the information
frequent user, and the one who made less than 178 contribu-                of how users are prone to each class of items. A similar approach
tions to the data is considered in the long tail proportion of the         is conducted by [2], when three different types of users are de-
distribution, located in the right area in Figure 3.                       fined according to their interest in popular items. Here, we are
                                                                                       APT-SH     APT-MT    APT-LT



                                                                        movielens          28                         59                       12




                                                                           netflix          30                       53                   16




                                                                       Figure 5: Mainstreaminess, or how users are prone to each
                                                                       item class, for Movielens and Netflix datasets.


                                                                       items presented in each round of suggestions, in a regular oper-
                                                                       ation of a state-of-the-art algorithm for Collaborative Filtering
                                                                       named Variational Autoencoders [15].
                                                                          Variational Autoencoders (VAE) can be interpreted as a model
                                                                       whose aim is to find the probability distribution responsible for
                                                                       generating their input data. Lets suppose a set of input data
                                                                       π‘₯ ∈ R𝑑π‘₯ following an unknown probability distribution 𝑝 (π‘₯).
                                                                       And a set of latent variables defined in a low dimensional space
                                                                       𝑧 ∈ R𝑑𝑧 (𝑑𝑧 β‰ͺ 𝑑π‘₯ ). The final model can be summarized as
                                                                       𝑝 (π‘₯, 𝑧) = 𝑝 (π‘₯ |𝑧)𝑝 (𝑧), from where one could marginalize 𝑧 and
                                                                       find 𝑝 (π‘₯). But the situation is that for most cases this integral
                                                                       can not be found in closed form [14].
                                                                          Variational Inference (VI) [13] was recently proposed to ad-
Figure 4: Log-log plots for MovieLens (top) and Netflix                dress this problem thorught optimization, assuming that the dis-
(bottom) datasets interactions grouped by users.                       tribution can be approximated by a simpler one that still models
                                                                       the data. VI specifies 𝑄 as a family of densities where members
                                                                       π‘ž(𝑧|π‘₯) ∈ 𝑄 is a candidate to the conditional 𝑝 (𝑧|π‘₯). The infer-
interested in giving a general overview of the dataset by taking       ence occurs by minimizing the Kullback-Leibler (KL) between
an average of normalized profiles considering the categories of        approximated and original density. After re-arranging terms, we
items.                                                                 have
   The main idea is to iterate through each user: (i) building a                 log 𝑝 (π‘₯)βˆ’π·πΎπΏ [π‘ž(𝑧|π‘₯)||𝑝 (𝑧|π‘₯)] =
profile with how many items belong to each region of the item                                                                         (2)
                                                                                             E𝑧 [log 𝑝 (π‘₯ |𝑧)] βˆ’ 𝐷𝐾𝐿 [π‘ž(𝑧|π‘₯)||𝑝 (𝑧)]
distribution (SH, MT and LT), (ii) normalizing the profile by the
number of items consumed, and (iii) taking an average of these            We now want to maximize log 𝑝 (π‘₯) minus the approximation
values for characterizing the dataset as a whole. For doing so, we     error, and as an alternative we can optimize the second term. In
adopt the Average Percentage Tail [1]:                                 order to do this, we rely on parametric distributions π‘žπœ™ and π‘πœƒ .
                                                                       The optimization process will correspond to optimize parameters
                        1 Γ• |{𝑖, 𝑖 ∈ (𝐿(𝑒) ∩ Ξ¦)}|
              𝐴𝑃𝑇 =                                             (1)    πœ™ and πœƒ of these distributions with:
                      |π‘ˆπ‘‘ |            |𝐿(𝑒)|
                          𝑒 βˆˆπ‘ˆπ‘‘
                                                                                 Lπœƒ,πœ™ = E𝑧 [log π‘πœƒ (π‘₯ |𝑧)] βˆ’ 𝛽 Β· 𝐷𝐾𝐿 [π‘žπœ™ (𝑧|π‘₯)||π‘πœƒ (𝑧)]    (3)
where Ξ¦ corresponds to one of the three categories of items
(short-head, medium-tail or long-tail), 𝐿(𝑒) to the profile of users   Where L is the Evidence Lower Bound (ELBO), π‘πœƒ (π‘₯ |𝑧) corre-
𝑒, and π‘ˆπ‘‘ to the set of users. For ease of representation we define    sponds to the estimation of 𝑧 space departing from input data,
APT-SH, APT-MT and APT-LT as the proportion of each category           named Encoder, and π‘žπœ™ (𝑧|π‘₯) corresponds to estimating the orig-
respectively.                                                          inal data departing from the latent space, named Decoder (Fig-
   The mainstreaminess measurements for the Movielens and              ure 6). And this defines Variational Autoencoder. The first term
Netflix datasets appear in Figure 5. These results indicate Netflix    addresses the reconstruction error, and the second term the er-
users more prone to popular items than Movielens ones. The             ror of distribution approximation. The parameter 𝛽 controls the
highest proportion of MT consumption is observed in the case           strength of regularization [15].
of Movielens, together with the smallest proportion of LT.                 We start from the implementation published by the authors3
                                                                       and include an item mapper to it so the model can refer each
                                                                       item to its category in the Long Tail model. The proportion of the
4   POPULARITY BIAS IN VARIATIONAL
                                                                       items belonging to each category in the top-k recommendation
    AUTOENCODERS                                                       list is measured with equation 1.
In order to verify the proposition in a recommendation situation,
we count the proportion of long tail, medium tail and short head       3 https://github.com/dawenl/vae_cf
              Encoder                 Decoder                              0.8
                                                                                                                                APT-SH@10
                                                                           0.7                                                  APT-MT@10
                                                                                                                                APT-LT@10
                                                                           0.6
                                                                           0.5
                                      zu                                   0.4
                                                                           0.3
                                                                           0.2
                                                                           0.1
             pθ(zu|xu)                     qφ(xu|zu)
                                                                           0.0
                                                                                 0   25     50      75    100     125     150    175
                                                                                                         epochs
             Figure 6: Variational Autoencoders
                                                                           0.8
                                                                                                                                APT-SH@10
                                                                           0.7                                                  APT-MT@10
                                                                                                                                APT-LT@10
    In the case of Movielens data, users who rated less than 5             0.6
 items as well as ratings (fom 0 to 5) lower than 3.5 were removed,
 ending up with 9,990,682 watching events from 136,677 users and           0.5
 20,720 movies. 10,000 users were separated and split for validation       0.4
 and test (5,000/ 5,000). VAE was trained with two hidden layers
                                                                           0.3
 [20,720 -> 600 -> 200 -> 600 -> 20,720] for 200 epochs. The training
 batch size was set to 500 and the validation batch to 2,000. Weight       0.2
 initialization, activation functions, learning rate, and 𝛽 regulation     0.1
were inherited from [15].
    The general results for the validation set achieved 0.33 for           0.0
                                                                                 0   25     50      75    100     125     150    175
 NDCG@10 and 0.34 for Recall@10 as the best results, and with                                            epochs
 reasonably stable values after hundred of epochs.
    The proportion of LT items increases in the first epochs of the      Figure 7: APT-SH, APT-MT and APT-LT proportions in the
 training, but stabilizes in an irrelevant proportions of recommen-      validation set during Variational Autoencoders for Collab-
 dation results during the process (Figure 7). The proportion of         orative Filtering training procedure for: (Top) Movielens
 SH items starts with extremely high proportion, for reaching an         and (Bottom) Netflix datasets.
 almost stationary state around 60% of items results in the first 10
 higher scores provided by the recommender. A complementary
 effect is observed for MT items, representing approximately 40%         by adding more data to the input (e.g., [22]), or by performing
 of items after few epochs.                                              database repair [19]. In-processing approaches target at modi-
    In the case of Netflix, ratings lower than 3.5 and users who         fying existing or introducing new algorithms that result in fair
 rated less than 5 items were also removed, ending up with 56,785,778    recommendations, e.g., by removing bias. Existing approaches
watching events from 461,285 users and 17,767 movies (sparsity:          focus on fairness-aware matrix factorization [24], multi-armed
 0.693%). The same parameter values were replicated, except for          bandits [11] and tensor factorization (e.g., [25]). When fairness
 the size of input layer, which is smaller now [17,767 -> 600 -> 200     with respect to both consumers and to item providers is impor-
-> 600 -> 17,767]. 40,000 users were separated and divided equaly        tant, variants of the well-known sparse linear method (SLIM) can
 for validation and test. The model was trained for 200 epochs.          be used to negotiate the trade-off between fairness and accuracy
    The results achieved for the validation data were 0.32 for both      and improve the balance of user and item neighborhoods [6].
 NDCG@10 and Recall@10 metrics, comparable to best scenarios             Alternatively, we can augment the learning objective in matrix
 provided in the original paper, 0.39 and 0.35, but for NDCG@100         factorization by adding a smoothed variation of a fairness metric
 and Recall@20 respectively.                                             [24]. As another example, [5] presents a method that mitigate
    A similar behavior is observed in the case of Netflix data, but in   bias to increase fairness by incorporating randomness in varia-
 a slightly lower level than in the previous dataset. The proportion     tional autoencoders recommenders. Post-processing approaches
 of extremely popular items corresponds now to approximately             treat the algorithms for producing recommendations as black
 55% of suggestions. The proportion of LT items is also irrelevant,      boxes, without changing their inner workings. To ensure fairness,
 as one can notice in Figure 7.                                          they modify the output of the algorithm (e.g., [12]). Moving from
                                                                         individuals to groups, significant research efforts have been done
5    RELATED WORK                                                        recently (e.g., [16, 20, 23]), targeting at maximizing the satisfac-
                                                                         tion of each group member, while minimizing the unfairness
   Fairness: Typically, approaches for amplifying biases focus
                                                                         between them.
on how to strengthen fairness. In recommendations, such ap-
proaches can be distinguished as pre-processing, in-processing              Popularity Bias: The popularity bias in recommendation re-
and post-processing. Pre-processing approaches target at trans-          sults may be inherited from the data used to train the models,
forming the data so that any underlying bias or discrimination is        or even from methods and metrics applied by them. [3] points
removed. Such approaches work on modifying the input to the              for the limitations of error-based evaluation metrics widely used
recommender, for example, by appropriate sampling (e.g., [7]),           in the field of recommender systems. They argue that methods
trained to maximize the satisfaction of the majority of users will       ones working with Movielens and Netflix datasets, that popular-
perform well on these metrics, but the problem relies on the fact        ity bias is present in most of the data available for experiments.
that items with many training ratings will tend to have more             This should help future studies when deciding thresholds for
positives test ratings, and will be liked by more users according        filtering items, when separating users in a test set, and also for
to the test data.                                                        considering the possibility of potential popular items with few
   It is important to differentiates the preferences expressed by        ratings.
user in historical data, from the true preference of a hypothetical          As future work, we aim to explore in detail how to address
scenario where all users would have rated all items, as pointed          popularity bias in the specific case of Variational Autoencoders
in [21]. The author proposes a nearly unbiased accuracy mea-             applied in collaborative filtering. This is a powerful and scalable
surement for recommendation experiments, named Popularity-               solutions, for high performance recommendations, but also with
Stratified Recall, which favors items from the long tail with the        space for improvements.
aim of approximating observed and true preferences. The power-
law modeling is proposed as a surrogate of the unobserved rating         REFERENCES
information, in the context where recommendations from the                [1] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling
                                                                              Popularity Bias in Learning-to-Rank Recommendation. In RecSys.
long tail present small bias, but also increase variance and reduce       [2] Himan Abdollahpouri, Masoud Mansoury, Robin Burke, and Bamshad
the accuracy. [1] addresses the popularity bias in matrix factor-             Mobasher. 2019. The Unfairness of Popularity Bias in Recommendation. In
ization solutions for recommendation by exploring the trade off               RMSE.
                                                                          [3] Alejandro BellogΓ­n, Pablo Castells, and IvΓ‘n Cantador. 2017. Statistical biases in
between long tail coverage and ranking performance. The regu-                 Information Retrieval metrics for recommender systems. Information Retrieval
larization factor is associated to the bias in results, to be adjusted        Journal 20, 6 (2017), 606–634.
in the experiments.                                                       [4] James Bennett, Stan Lanning, and Netflix Netflix. 2007. The Netflix Prize. In
                                                                              In KDD Cup and Workshop in conjunction with KDD.
   [10] tackles the specific situation of music recommendation            [5] Rodrigo Borges and Kostas Stefanidis. 2019. Enhancing Long Term Fairness
platforms considering the bias presented in datasets available for            in Recommendations with Variational Autoencoders. In MEDES.
                                                                          [6] Robin Burke. 2017. Multisided Fairness for Recommendation. CoRR
training ML models as a possible reason for a situation where a               abs/1707.00093 (2017).
group of artists are not suggested to users and therefore receive         [7] L. Elisa Celis, Amit Deshpande, Tarun Kathuria, and Nisheeth K. Vishnoi. 2016.
less compensation by streaming content providers. [8] has com-                How to be Fair and Diverse? CoRR abs/1610.07183 (2016).
                                                                          [8] Γ’. Celma and P. Cano. 2008. From hits to niches? or how popular artists can
pared Collaborative Filtering, Content-Based and Expert-Based                 bias music recommendation and discovery. In 2nd Workshop on Large-Scale
music recommendation engines for detecting popularity effect                  Recommender Systems and the Netflix Prize Competition (ACM KDD).
and the influence of the most popular artists in the network. They        [9] Aaron. Clauset, Cosma Rohilla. Shalizi, and M. E. J. Newman. 2009. Power-Law
                                                                              Distributions in Empirical Data. SIAM Rev. 51, 4 (2009), 661–703.
figured out that the collaborative algorithm is prone to popularity      [10] Andre Holzapfel, Bob L. Sturm, and Mark Coeckelbergh. 2018. Ethical Di-
bias, and that the two other approaches are more efficient when               mensions of Music Information Retrieval Technology. Transactions of the
                                                                              International Society for Music Information Retrieval 1, 1 (2018), 44 – 55.
exploring the long tail of the play count distributions.                 [11] Matthew Joseph, Michael J. Kearns, Jamie H. Morgenstern, and Aaron Roth.
                                                                              2016. Fairness in Learning: Classic and Contextual Bandits. In NIPS.
                                                                         [12] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. 2018.
                                                                              Recommendation Independence. In FAT.
6   CONCLUSION                                                           [13] Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes.
                                                                              In ICLR.
When discussing about popularity bias, the first question that           [14] Diederik P. Kingma and Max Welling. 2019. An Introduction to Variational
may come to someone is: if there are items more attractive than               Autoencoders. CoRR abs/1906.02691 (2019).
others, why promoting unpopular ones to a wider public? The              [15] Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. 2018.
                                                                              Variational Autoencoders for Collaborative Filtering. In WWW.
first answer to this is commonly referred as the cold-start sit-         [16] Xiao Lin, Min Zhang, Yongfeng Zhang, Zhaoquan Gu, Yiqun Liu, and Shaoping
uation: when new items are introduced in the platforms, and                   Ma. 2017. Fairness-Aware Group Recommendation with Pareto-Efficiency. In
need to be incorporated in the algorithm. We are here talking                 RecSys.
                                                                         [17] Yoon-Joo Park and Alexander Tuzhilin. 2008. The Long Tail of Recommender
about potentially popular items with no historical data, that will            Systems and How to Leverage It. In RecSys.
need to enter the long tail before reaching the short head of the        [18] Noveen Sachdeva, Giuseppe Manco, Ettore Ritacco, and Vikram Pudi. 2019.
                                                                              Sequential Variational Autoencoders for Collaborative Filtering. In WSDM.
distribution. The challenge, in this case, is promoting relevant         [19] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional
items among unpopular ones.                                                   Fairness: Causal Database Repair for Algorithmic Fairness. In SIGMOD.
   Mainstreaminess measurement have revealed users’ prefer-              [20] Dimitris Serbos, Shuyao Qi, Nikos Mamoulis, Evaggelia Pitoura, and Panayiotis
                                                                              Tsaparas. 2017. Fairness in Package-to-Group Recommendations. In WWW.
ences concentrated in MT items. In the case of Movielens it              [21] Harald Steck. 2011. Item Popularity and Recommendation Accuracy. In RecSys.
sums almost double the size of SH, and 6 times of the LT pro-            [22] Harald Steck. 2018. Calibrated recommendations. In RecSys.
portion. Even then, the recommenders suggests majority of SH             [23] Maria Stratigi, Jyrki Nummenmaa, Evaggelia Pitoura, and Kostas Stefanidis.
                                                                              2020. Fair Sequential Group Recommendations. In SAC.
items during the training phase. As discussed here before, the           [24] Sirui Yao and Bert Huang. 2017. Beyond Parity: Fairness Objectives for Col-
recommnender have probably learned with more information                      laborative Filtering. In NIPS.
                                                                         [25] Ziwei Zhu, Xia Hu, and James Caverlee. 2018. Fairness-Aware Tensor-Based
about popular items and this results in biased results.                       Recommendation. In CIKM.
   We consider the popularity bias effect as being associated to
the inner operation of the platforms, as much as to the social
effect in which people interact with popular items. Experiments
like the ones presented here are intended for measuring the
overall bias, without distinguishing both sources. We argue that
characterizing popularity bias in recommenders data and algo-
rithms is a first step for addressing it, and addressing also, as a
consequence, the problem of cold-start.
   Another objective of this study was to claim attention for the
researchers working on recommender systems field, specially the