Exploring Artist Gender Bias in Music Recommendation
                         Dougal Shakespeare1 , Lorenzo Porcaro1 , Emilia Gómez1,2 , Carlos Castillo3
                                      1 Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
                                              2 Joint Research Centre, European Commission, Seville, Spain
                          3 Web Science and Social Computing Group, Universitat Pompeu Fabra, Barcelona, Spain

                                                        dougalian.shakespeare01@estudiant.upf.edu
                                                  {lorenzo.porcaro,emilia.gomez,carlos.castillo}@upf.edu

ABSTRACT                                                                                 the RS community [32], shared practices for evaluating the impact
Music Recommender Systems (mRS) are designed to give person-                             of recommendations still are missing.
alised and meaningful recommendations of items (i.e. songs, playlists                       Notwithstanding, recent years have seen a rise in awareness in
or artists) to a user base, thereby reflecting and further complement-                   the scientific community about the implications of socio-technical
ing individual users’ specific music preferences. Whilst accuracy                        systems’ design and implementation responsible of reinforcing bias
metrics have been widely applied to evaluate recommendations in                          and discrimination [4, 42]. Music Information Retrieval (MIR) re-
mRS literature, evaluating a user’s item utility from other impact-                      search is still in its early-stage with regards to the analysis of the
oriented perspectives, including their potential for discrimination,                     ethical dimensions and impact of music technology [19, 22, 37, 39],
is still a novel evaluation practice in the music domain. In this work,                  and several challenges still need to be tackled when approaching
we center our attention on a specific phenomenon for which we                            MIR research from a socio-technical perspective. A common issue
want to estimate if mRS may exacerbate its impact: gender bias.                          is the availability of data, often limited in terms of size, user in-
Our work presents an exploratory study, analyzing the extent to                          formation or musical information, and as in many other fields, a
which commonly deployed state of the art Collaborative Filtering                         chronic shortage of gender-disaggregated data [35]. The difficulties
(CF) algorithms may act to further increase or decrease artist gen-                      in our research to retrieve the artists’ gender are just one example
der bias. To assess group biases introduced by CF, we deploy a                           of this limitation, as presented in Section 3 and 4.
recently proposed metric of bias disparity on two listening event                           We center our attention on a specific phenomenon that recom-
datasets: the LFM-1b dataset, and the earlier constructed Celma’s                        mender systems may exacerbate: gender bias. In its broader sense,
dataset. Our work traces the causes of disparity to variations in                        gender discrimination is a disadvantage for a group of people based
input gender distributions and user-item preferences, highlighting                       on their gender. Far from being an emerging problem, gender dis-
the effect such configurations can have on user’s gender bias after                      crimination has its roots in cultural practices historically related
recommendation generation.                                                               with socio-political power differentials [12]. Nonetheless, the mod-
                                                                                         ern day prevalence of gender discrimination is not to be understated:
CCS CONCEPTS                                                                             recent reports find the disproportionate treatment of female artists
                                                                                         to be prevalent in the Western music industry to this day 1 . Whilst
• Social and professional topics → Socio-technical systems; Gen-
                                                                                         the cause of such treatment is multifaceted, our work traces the
der; • Information systems → Collaborative filtering; Rec-
                                                                                         influence of one factor evidenced to be present in the works of
ommender systems.
                                                                                         Millar [33] that is, the pre-existing gender bias of a music listener.
                                                                                            In this exploratory study, we assess the extent to which Collabo-
KEYWORDS
                                                                                         rative Filtering (CF) algorithms commonly deployed in mRS may
gender bias, bias disparity, music recommendation                                        exacerbate pre-existing users’ gender biases thereby affecting an
                                                                                         artist gender’s exposure and proportional representation. We focus
1    INTRODUCTION                                                                        on the measurement of bias disparity in recommender systems, de-
Impact-oriented Recommender System (RS) research is gaining                              fined as "[...] the case where the recommender system introduces bias
attention as a novel paradigm for understanding not only how                             in the data, by amplifying existing biases and reinforcing stereotypes."
users interact with recommendations, but also for shedding light                         [41]. Building on existing literature [29, 31, 41, 43], we first repro-
on how these interactions can influence users’ behaviours in the                         duce the study presented by Lin et al. [29], in which preference bias
short- and the long-term [25]. An outstanding issue when study-                          amplification in collaborative recommendation is analyzed using
ing the possible impact of RS is the heterogeneity of evaluation                         the MovieLens dataset[21], a dataset of user activity with a movie
procedures described in the literature. Evaluating recommender                           recommendation system. In our work, we focus on the music do-
systems is a non-trivial task because of the multiple facets that                        main making use of two Last.fm2 listening event datasets publicly
a good recommendation can have, and the multiple players influ-                          available: 1) Celma’s LFM-360k dataset [10]; 2) Schedl’s LFM-1b
encing these aspects [20]. Even if the need for going beyond the                         dataset [38]. Our goal is twofold: on one hand, reproducing and ver-
evaluation in terms of accuracy metrics has been well-recognized by                      ifying whether previous results [29] hold across different datasets.
                                                                                         On the other hand, we aim at highlighting which aspects specific
Proceedings of the ImpactRS Workshop at ACM RecSys ’20, September 25, 2020, Virtual
Event, Brazil.
                                                                                         1 http://assets.uscannenberg.org/docs/aii-inclusion-recording-studio-2019.pdf
Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                       2 https://www.last.fm
to the music domain can be extracted by this analysis, connecting          et al. in [15] have recently proposed a novel means of mitigating
with existing literature on gender bias in music preferences [3, 33].      the derivation of sensitive features (such as gender) in the latent
   The paper is structured as follows. Section 2 provides an overview      space, using fairness constraints based on the predictability of such
of previous works related to bias in Information Technology, fo-           features. A similar approach proposing fairness-aware tensor-based
cusing on gender bias, but also how this bias has been approached          recommendation is also presented by Zhu et al. in [44].
in music-related fields. We then introduce the considered datasets,           In the music domain, Aguiar et al. [2] propose a methodology
LFM-1b and LFM-360K respectively in Section 3 and 4. In Section 5,         to assess the extent to which artists ranked in Spotify playlists
the recommendation models used and the experimental settings are           are affected by gender after accounting for plausible determinants
presented, followed by Section 6 which details the results obtained.       of inclusion on playlists such as country, song characteristics (e.g.
Lastly, in section 7 conclusions and future work are discussed.            bpm, key signature), and past streaming success. The authors find
                                                                           that there is some evidence consistent with the presence of bias
                                                                           (both for and against female artists), however they do not draw
2   RELATED WORK                                                           subsequent relations between this and the disproportionate low
The notion of bias has been extensively explored in the Information        streaming share of female artists on the platform. In the work by
Retrieval domain [4, 5, 7, 11, 24]. Typically, metrics aim to capture      Anglata-Tort et al. [3], through the analysis of UK top 5 music charts
relative bias (i.e. bias pre-existing in data, for example in user lis-    between the years 1960-1995, authors show how popular music is
tening histories in LFM-1b), and algorithmic bias (i.e. how filtering      affected by a large gender inequality, showing the presence of an
algorithms can result in unfair item and user treatment) to measure        existing bias in the listening preferences towards male artists. Sim-
disproportionate unfair treatment of a protected group.                    ilarly, Millar in [33], surveying a population of Australian young
   One of the most well-studied biases in RS literature is popularity      adults, shows how music preferences are affected by gender bias, ev-
bias, with the music domain being no exception to this phenomenon          idencing differences between male and female listeners. In contrast,
[6, 10, 28]. This describes the scenario in which a few popular items      in our work we apply an auditing strategy for bias propagation
are recommended frequently, while the majority of items in the             showing under which conditions input preferences are reflected
long-tail do not get proportional attention. Highlighted in literature     in RS output, inferring music preferences from the users’ listening
as a prominent issue for CF algorithms [1, 10, 34], Kowald et al.          history grouped with respect to the artists’ gender.
in [28] find that from a user’s perspective the groups who do not
favor popular items may receive worsened recommendations in                3    THE LFM-1B DATASET
terms of accuracy and calibration. Moreover, Ferraro et al. in [18]        The LFM-1b dataset consists of more than one billion listening
study the effect of musical styles with respect to popularity bias,        events created by over 120,000 users of the music streaming plat-
showing that CF approaches increase users’ exposure to popular             form Last.fm [38]. In our analysis, we consider user-artist play-
musical styles.                                                            counts formed by aggregating user-song listening events by com-
   Bias Disparity is a metric deployed to assess bias propagation          mon artists. We then scale logarithmically the number of listens,
across user’s and item’s group, measuring the deviation of the rec-        as done in [13, 26]. We work with a filtered version of the dataset
ommender output from the input preference, as detailed in Section          in which: a) we remove users who listened to less than 10 unique
5.1. A first application to the RS domain was described by Tsintzou        artists, and artists listened to by less than 10 users; b) we discard
et al. [41], but the metric has recently gained more traction in its       users whose listening history contains more than 25% of artists
application to different domains. In Lin et al. [29], bias disparity is    with unknown gender, to mitigate the impact of artists with missing
applied to measure the extent to which state of the art CF algorithms      gender in the dataset.
can exacerbate pre-existing biases in the MovieLens dataset. Their            User gender is represented in the dataset with three categories:
findings show significant differences in bias propagation across           male, female and N/A. We choose to focus only on users with self-
memory- and model-based CF algorithms.                                     declared gender, working with two final categories of user gender:
   Gender treatment and issues of proportional treatment in RS             male and female. As shown in Table 1, distributions are highly
have been considered in a range of literature, for which we highlight      imbalanced towards men – 72% of the users are men.
some examples. Ekstrand et al. [17] examined gender distribution              Artist gender is not represented in the LFM-1b dataset, conse-
of item recommendations in the book RS domain. Results prove that          quently we retrieve this information from the open music ency-
commonly deployed CF models differ in the gender distributions of          clopedia MusicBrainz3 (MB) [40]. Code repositories to implement
generated item recommendation lists, such that neighbour-based             the following approach are made openly available4 alongside the
approaches are shown to proportionality reflect user-item pref-            acquired results of the data wrangling5 to elicit reproducibility .
erences in their reading histories, whereas model-based matrix                We identify five discrete categories of gender defined in the MB
factorisation favor books whose author is of male gender. Further-         database: male, female, other, N/A and undef. In the case of artists of
more, Ekstrand et al. in [16] study the effect of recommendation           gender N/A and undef, these are differentiated by artists for which
algorithms on the utility for users of different gender groups, find-      gender is not applicable and identifiable respectively. For bands,
ing difference in effectiveness across gender groups. Such work            we compute gender counts of all members and then compute an
highlights that the effect in utility does not exclusively benefit large
groups, implying that there may be other underlying latent factors         3 https://musicbrainz.org/
that influence recommendation accuracy. To address such issues             4 https://github.com/dshakes90/LFM-1b-MusicBrainz-Gender-Wrangler

of disproportionate gender treatment in recommendations, Edizel            5 https://zenodo.org/record/3964506#.XyE5N0FKg5n
                           LFM-1b           LFM-360k                      more commonly composed of bands in comparison to the top-rated
                         male female       male female                    female artists.

               Users     31.4K    11.5K    94.3K    30.8K                 4   THE LFM-360K DATASET
                %        71.67    28.33    75.40    24.60
                                                                          The LFM-360k dataset [10] consists of approximately 360,000 users
              Artists    127K     27.3K    50.4K    10.5K
                                                                          listening histories from Last.fm collected during Fall 2008, present-
                %        82.30    17.70    82.83    17.17
                                                                          ing a snapshot of listening activity for an earlier period in compari-
             Top-head    25.7K     4.8K    10.1K     1.5K
                                                                          son to the LFM-1b dataset. With respect to user gender distributions
                %        85.21    15.79    86.99    13.01
                                                                          the proportion of users with a self-declared gender rises to 91%
             Long-tail   100K     22.2K    38.7K     8.5K                 whereas similarly to the LFM-1b dataset, artist gender is not defined.
                %        81.87    18.13    81.95    18.05                 To resolve this, we implement the same pre-processing method-
Table 1: Users’ and artists’ distributions after the filtering            ology with the MB database as described for the LFM-1b dataset.
process. “Top-head” artists are the top 20% of artists by play            After further applying the filtering criteria previously detailed, we
counts, while the remaining 80% are the “long-tail.”                      are able to identify 31% of artists with a known gender, a proportion
                                                                          notably higher than that of what we were able to identify for the
                                                                          LFM-1b dataset. As presented in Table 1, artist gender distributions
                                                                          in the filtered dataset are once again highly imbalanced towards
                                 LFM-1b                                   artists classified as men. For users with identified gender, we again
                                                                          observe a high imbalance towards male users (75%) comparable to
       No.     Male artist   Plays     Female artist     Plays            rates observed in the LFM-1b dataset. When comparing the two
        1      Radiohead     2.6M      Lana Del Rey      1.2M             datasets we observe several additional differences and similarities
        2      The Beatles   2.5M       Lady Gaga        1.1M             which may impact the propagation of a gender bias in artist recom-
        3      Pink Floyd    2.1M        Rihanna         0.8M             mendations. First, the number of users is significantly larger than
        4      Daft Punk     2.0M         Björk          0.7M             that of the LFM-1b, whilst the number of artists is much smaller.
        5       Metallica    1.9M        Madonna         0.6M             Second, sparsity is higher in the LFM-360k dataset in comparison
                                                                          to the LFM-1b. Third, with regard to the top 5 artists of male and
                             LFM-360k
                                                                          female gender in the dataset we observe significantly higher play-
        1      Radiohead     6.2M          Björk         1.3M             counts for artists classified as male in comparison to the LFM-1b
        2      The Beatles   5.4M      Avril Lavigne     1.1M             dataset, as shown in Table 1. With regard to similarities across
        3       In Flames    4.9M       Madonna          1.1M             the two datasets, we observe that top 5 popular male artists are
        4       Metallica    4.3M     Britney Spears     0.9M             more commonly bands in comparison to the top 5 female artists.
        5         Muse       4.2M     Regina Spektor     0.9M             In addition, we observe that the long-tail of both datasets contains
Table 2: Top 5 artists ordered by total play counts in LFM-1b             significantly higher distribution of female artists, in comparison
and LFM-360k datasets.                                                    to the top head reinforcing the conclusion that female artists are
                                                                          significantly more likely to be less popular on the Last.fm platform
                                                                          and hence, more likely to be less recommended as a result of this
                                                                          popularity bias.

overall classification based on whichever count has a majority. In        5 METHODOLOGY
the case of artists with gender ties (e.g, a band consisting of 2 males
and 2 females), we discard such artists from our final analysis as
                                                                          5.1 Evaluation Metrics
gender is in this instance, deemed ambiguous. After applying this         In this section, we formally outline the metrics of preference ratio,
methodology, we are able to identify 27% of artists with a known-         bias disparity, as well as accuracy and beyond-accuracy metrics
gender. Distributions are observed to be highly imbalanced such           considered during the evaluation.
that artists of male gender consist of the majority (82%) of artists         Preference ratio (PR). Let U be the set of n users, I be the set
for which gender can be identified, as shown in Table 1.                  of m items and S be the nxm input matrix, where S(u, i) = 1 if user
   In our final analysis, we further filter artists not identified as     u has selected item i, and zero otherwise. Given matrix S, the input
male or female according to the procedure described above. Artists        preference ratio for user group G on item category C is the fraction
of gender other are discarded as we deem such data to be too sparse       of liked items by group G in category C, formally defined as the
to be informative in the analysis of users’ listening preferences. We     following:
                                                                                                             u ∈G i ∈C S(u, i)
                                                                                                          Í        Í
note this group merits further future evaluation, perhaps relying                           PR S (G, C) = Í                                 (1)
                                                                                                              u ∈G i ∈I S(u, i)
                                                                                                                   Í
on qualitative methods, and limitations of this binary approach
are discussed in Section 7. Table 2 presents the top 5 artists based         Bias disparity (BD). It is defined to be the relative difference
on the total sum of play counts in the filtered LFM-1b dataset. We        between the preference bias for input S and output of a recommen-
observe a trend for male artists’ popularity, having approximately        dation algorithm R. Formally we define the metric as the following:
twice as much play counts as top-rated female artists/bands. We                                         PR R (G, C) − PR S (G, C)
also observe a trend for the top male artists on the platform to be                        BD(G, C) =                                       (2)
                                                                                                                PR S (G, C)
    In our analysis, we generate a set of r ranked items, Ru which        folds training set. For each of the algorithms tested, we compute all
have the highest predicted ratings for a given user u, limiting the       evaluation metrics and preference ratios over each fold and then
value of r to 5.                                                          subsequently report average performance. In our work we set N
    Accuracy and beyond-accuracy metrics. To evaluate the RS              = 10, M = 20 and n = 5, thereby generating top-5 recommendation
performance, we additionally deploy two accuracy metrics: Preci-          lists. We consider a user’s test set of size N as the sample space for
sion, nDCG, and three beyond-accuracy metrics: coverage, spread           recommendations to be formed.
and long-tail percentage. We refer to the metrics formulation as
detailed in the work by Noia et al. [14]. Precision (p@n) captures the    5.3     Experimental Design
proportion of relevant items in top-N recommendations, such that          We set up two experimental designs to evaluate variations in gender
relevance is a binary function that represents the relevance of item      bias disparity across recommended artists and user groups for the
i for a user u. In our work, we consider relevant a recommendation        two datasets. For all experiments detailed, code repositories are
which is greater or equal to the average scaled listening count for       made openly available6 . Experiment 1 is a real-world scenario in
a user, after discarding outliers in the data computed using the in-      which male and female gender distributions are representative of
terquartile range. Although p@n is useful for analysing generated         those in both datasets. Experiment 2 is an extreme scenario in which
item recommendations, it does not capture accuracy aspects relat-         all users have high levels of preference ratio, representing extreme
ing to the rank of a recommendation. Hence, in our work we also           listening preferences towards artists of a specific gender.
deploy the metric nDCG, a rank sensitive metric used to evaluate              Experiment 1. We generate recommendations for a sample of
the accuracy of a RS. With respect to metrics beyond accuracy, we         all users for which gender can be identified. In the LFM-1b dataset,
utilise both spread and coverage to capture a recommender sys-            we limit the size of this sample to be 30% randomly chosen of all
tems ability to recommend a broad range of unique items. Such             male and female users in the whole dataset (approx 12,000 users),
approaches are important to consider in our work to potentially           due to computational constraints. The size of the user sample for
reason and explain bias propagation across artist genders. The met-       the LFM-360k dataset was also constrained to be approximately the
ric long-tail percentage is used to capture the proportion of item        same size as samples for the LFM-1b dataset. User and artist gender
recommendations which exist in the long tail. In our work, we de-         distributions in both samples are representative of overall gender
fine the long tail as the 80% of least popular items in the system. We    distributions in the entirety of both datasets. We therefore use this
use the metric to capture a filtering algorithms capacity to display      experiment to consider the case of gender bias propagation under
the popularity bias.                                                      a real world scenario, assessing the extent to which gender bias
                                                                          disparity may differ across datasets.
                                                                              Experiment 2. We generate recommendations only for a sam-
5.2    Recommendation Algorithms
                                                                          ple of male and female users which have high preference ratios in
We test several commonly deployed memory- and model-based CF              the dataset, thereby simulating an extreme scenario under which
algorithms, following a similar approach to previous work [28, 29].       all users are highly biased towards one artist gender group in their
Using Surprise [23], a Python library for recommender systems,            listening preferences. For the LFM-1b dataset, we select the top
we formulate our music recommendations as a rating prediction             30% of both male and female user groups with the highest max-
problem where we predict the preference of a target user u for a          imum input preference ratios, maintaining both the proportions
target artist a. We then evaluate RS recommending the top-5 artists       of male and female users in the datasets, and the sample size of
with the highest predicted preferences.                                   experiment 1. For the LFM-360k dataset, we sample users from both
   We consider two types of CF algorithms: (1) KNN-based ap-              male and female user groups maintaining the distribution of male
proach: UserKNNAvg [27], and (2) factorisation-based approach:            and female users in the original dataset. The final user sample has
Non-Negative Matrix Factorization (NMF) [30]. Hyperparameters             approximately the same sample size as that of the LFM-1b user
of UserKNNAvg and NMF are tuned to give the best performance              sample.
we can achieve with respect to the rank aware metric, nDCG. In                Figure 1 represents the distributions of users’ input preference
addition, we consider two MostPopular and UserItemAvg algorithms          ratio towards male and female artist groups. For both datasets con-
which respectively, recommend the most popular and highest rated          sidered in this study, it shows that only around 20% of users have a
artists. We consider these algorithms for a baseline comparison.          preference ratio towards male artists lower than 0.8. On the con-
   A variation of the leave-l-out evaluation detailed in [9] is per-      trary, 80% of users have a preference ratio lower than 0.2 towards
formed whereby we translate the approach to evaluate a top-n RS.          female artists. Due to the disproportionate amount of users with
Drawing influence from the methodology of Said et al. [36] we             extreme preferences for male artists across both datasets, a random
define 3 parameters: (1) n, the size of the recommendation list gen-      sampling methodology proposed does little to assess extreme pref-
erated, (2) N , the number of items selected for each user to appear in   erence towards female artists, resulting in a situation very similar
the test set. N is constrained to be > n to allow for variance in item    to experiment 1. To resolve this, we further limit our sample space
recommendations across tested algorithms. (3) M, the minimum              to only users who have extreme preference for female artists, with
number of unique artists listened to by a user. M is constrained to       input preference ratio towards female artists > 0.6. This results in
be > N to ensure a non-empty test set is able to be formed for each       a sample size reduction to 100 users for the LFM-1b dataset, and
user. We construct three folds, randomly selecting for each user,         400 users for the LFM-360k dataset. Although reduced in size in
N items in their listening history to belong to the fold’s test set
and then subsequently removing these listening events from the            6 https://github.com/dshakes90/Last-fm-Gender-Bias-Analysis
                                                                                        Most Popular     UserItem Avg     UserKNN Avg       NMF
                                                                           precision        0.010            0.595             0.676       *0.734
                                                                           nDCG             0.012            0.663             0.793       *0.880
                                                                           coverage        1.7E-04           0.364            *0.558        0.552
                                                                           spread           2.322            11.85            *12.84        12.72
                                                                           longtail %         0              0.027             0.053       *0.054
                                                                          Table 3: Experiment 1 evaluation results on the LFM-1b
                                                                          dataset. Values in bold represent the top value, while
                                                                          marked with * are results where the difference is statistically
                                                                          significant, according to a t-test with α = 0.05.


                                                                          genders. The popularity-based algorithm results in the highest lev-
                                                                          els of bias disparity for both male and female users, whilst the NMF
                                                                          and UserKNNAvg algorithms tested result in the lowest absolute
                                                                          levels of bias disparity with marginal difference in bias propagation
                                                                          across the two algorithms. Whatsmore, our findings show male
                                                                          users to be more affected by bias propagation in the LFM-1b dataset
                                                                          whilst for LFM-360K, we observe bias propagation to be greater for
                                                                          female users thereby inline with the findings of Lin et al. [29]. With
                                                                          regard to bias disparity for female artists, negative levels are ob-
                                                                          served for all algorithms tested. The MostPopular algorithm results
                                                                          in the lowest levels of bias disparity due to female artists having
                                                                          significantly lower popularity for both datasets tested, as shown in
                                                                          Table 1. We observe bias propagation to be greater for recommen-
Figure 1: Input Preference Ratio (PR) distributions: LFM-1b               dations generated using the LFM-1b dataset reflected in the lower
(top) and LFM-360k (bottom).                                              long-tail percentage attained. This suggests that users in the LFM-1b
                                                                          dataset may be more subject to a popularity bias in comparison to
                                                                          LFM-360k which may translate to increased levels of gender bias
                                                                          disparity due to female artists proportionally residing less in the
comparison to experiments 1, we believe such experimental designs
                                                                          top-head. Together, our findings suggest that differences in bias
to be fundamental to measure the extent to which the treatment
                                                                          propagation across the two datasets may be traced to pre-existing
of users with extreme preferences differs across artist genders. Ex-
                                                                          bias entering the system in the form of listening events.
periment 2 represents a situation opposite to the one proposed in
experiment 1, thanks to which we can assess if bias propagation is
not embedded in the gender per se, but is a result of pre-existing        6.2    Experiment 2 - Extreme preferences
bias.                                                                     Considering users with extreme preferences for female artists we
                                                                          observe the inverse scenario of experiment 1, such that bias dispar-
6 RESULTS                                                                 ity is positive for female artists and negative towards male artists,
                                                                          as shown in Figure 3 and Figure 5. For both datasets, we comment
6.1 Experiment 1 - Whole population                                       that one cause of such disparity is a dramatic imbalance in users’
We report in Figure 2 preference ratio, and in Figure 3 bias disparity    listening preference, which then subsequently propagates through
results obtained with the LFM-1b dataset. Figure 4 and Figure 5           to other users’ recommendations. Our findings show that such bias
present preference ratio and bias disparity results respectively for      propagation is not reserved for male artists on the platform and can,
the LFM-360K dataset. The dotted lines in Figure 2 and Figure 4           under extreme scenarios emerge in the opposite manner. For both
represent input preference ratios whereas the plot’s bars display         memory- and model-based approaches tested we observe significant
output preference ratios computed from generated recommenda-              differences in bias disparity: NMF results in the smallest absolute
tion lists. With regard to pre-existing bias, users in both datasets      bias disparity increase thereby reflecting a users’ input preference,
display high and low input preference ratios for male and female          whereas the neighbour-based UserKNNAvg increases absolute bias
artists respectively, thereby in line with the findings of Millar [33].   disparity levels towards whichever user-artist preference is in the
In addition, for both artist genders input preference ratios can be       majority. The tendency of NMF to propagate less bias, positively
seen to be higher by users who share the same gender as the artist.       or negatively speaking, in comparison to the other models is also
With regard to bias propagation after recommendation, all recom-          reflected in the results obtained from the beyond-accuracy metrics
mendation models tested result in a positive bias disparity for male      evaluation. Indeed, for experiment 2 NMF achieves the high lev-
artists for which there is minimal variance in treatment across user      els of coverage, recommending wider subsets of artists, and at the
Figure 2: Preference Ratio (PR) results for LFM-1b dataset for experiment 1 (left column), and experiment 2 (right column).


 Figure 3: Bias Disparity (BD) results for LFM-1b dataset for experiment 1 (left column), and experiment 2 (right column).
Figure 4: Preference Ratio (PR) results for LFM-360k dataset for experiment 1 (left column), and experiment 2 (right column).


 Figure 5: Bias Disparity (BD) results for LFM-360k dataset for experiment 1 (left column), and experiment 2 (right column).
same time high levels of recommendation spread. Together these           REFERENCES
results suggest that the model-based algorithm considered in this         [1] Himan Abdollahpouri, Masoud Mansoury, Robin Burke, and Bamshad Mobasher.
study is capable of achieving a higher level of diversification in the        2019. The unfairness of popularity bias in recommendation. CEUR Workshop
                                                                              Proceedings 2440 (2019). arXiv:1907.13286
outcomes in comparison to the memory-based model. Translated to           [2] Luis Aguiar, Joel Waldfogel, and Sarah Waldfogel. 2018. Playlisting Favorites: Is
our scenario, it means that NMF is the algorithm that focuses less            Spotify Gender-Biased? Technical Report November. https://ec.europa.eu/jrc/
                                                                              sites/jrcsh/files/jrc113503.pdf
on recommending a specific gender group, avoiding the exacerba-           [3] Manuel Anglada-Tort, Amanda E Krause, and Adrian C North. 2019. Popular mu-
tion of pre-existing bias in the dataset that other recommendation            sic lyrics and musicians’ gender over time: A computational approach. Psychology
algorithms exhibit. Again, the effect of bias propagation is seen to          of Music (2019). https://doi.org/10.1177/0305735619871602
                                                                          [4] Ricardo Baeza-Yates. 2018. Bias on the web. Commun. ACM 61, 6 (2018), 54–61.
be more amplified in the case of the LFM-1b dataset.                          https://doi.org/10.1145/3209581
                                                                          [5] Solon Barocas and Andrew D. Selbst. 2014. Big Data’s Disparate Impact. California
                                                                              Law Review 671 (2014), 671–732.
                                                                          [6] Christine Bauer and Markus Schedl. 2019. Global and country-specific main-
7   CONCLUSIONS AND FUTURE WORK                                               streaminess measures: Definitions, analysis, and usage for improving personal-
                                                                              ized music recommendation systems. PLOS ONE i (2019), 1–36.
Studies of gender bias in music preferences, conducted in a field         [7] Engin Bozdag. 2013. Bias in algorithmic filtering and personalization. Ethics and
such as Music Psychology and Gender Studies, have already ev-                 Information Technology 15, 3 (2013), 209–227. https://doi.org/10.1007/s10676-
                                                                              013-9321-6
idenced how socio-cultural factors are responsible for disparate          [8] Judith Butler. 2006. Gender Trouble. Taylor and Francis.
treatment of not-male artists. In the field of MIR, relatively little     [9] RocÃŋo CaÃśamares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation
research has analyzed how existing technology can have a role                 options for recommender systems. Information Retrieval Journal 23 (03 2020).
                                                                              https://doi.org/10.1007/s10791-020-09371-3
in mitigating or amplifying this bias. In line with the studies on       [10] Òscar Celma. 2010. Music Recommendation and Discovery: The Long Tail, Long
bias disparity in the RS literature, focusing on the musical domain           Fail, and Long Play in the Digital Music Space. Springer-Verlag Berlin Heidelberg.
we show how recommendation outcomes can actually impact gen-             [11] Henriette Cramer, Jean Garcia-Gathright, Aaron Springer, and Sravana Reddy.
                                                                              2018. Assessing and addressing algorithmic bias in practice. Interactions 25, 6
der bias in music preferences. Using a binary gender classification,          (2018), 58–63. https://doi.org/10.1145/3278156
where users and artists are classified as male or female, we have        [12] Simone de Beauvoir. 1949. The Second Sex. Vintage Classics.
                                                                         [13] Sarah Dean, Sarah Rich, and Benjamin Recht. 2020. Recommendations and User
shown how at different levels recommender systems can propagate               Agency: The Reachability of Collaboratively-Filtered Information. In Proceedings
a pre-existing bias. In addition, simulating an “upside down” world           of the 3rd ACM Conference on Fairness, Accountability and Transparency (ACM
where users have a much higher preference towards female artists,             FAccT 2020). Barcelona, Spain, 436–445. https://doi.org/10.1145/3351095.3372866
                                                                         [14] Tommaso Di Noia, Jessica Rosati, Paolo Tomeo, and Eugenio Di Sciascio. 2017.
still we find evidence of an exacerbation of that bias. Our results           Adaptive multi-attribute diversity for recommender systems. Information Sciences
show that gender bias can be propagated by CF-based recommenda-               382-383 (2017), 234–253. https://doi.org/10.1016/j.ins.2016.11.015
tions, according to the bias present in the data. Hence, RS can have     [15] Bora Edizel, Francesco Bonchi, Sara Hajian, André Panisson, and Tamir Tassa.
                                                                              2019. FaiRecSys: mitigating algorithmic bias in recommender systems. In-
a role in propagating bias, but at least in our exploratory study, we         ternational Journal of Data Science and Analytics 9, 2 (2019), 197–213. https:
have not found evidence about if they cause the emergence of new              //doi.org/10.1007/s41060-019-00181-5
                                                                         [16] Michael D. Ekstrand, Mucun Tian, Jennifer D. Ekstrand, Oghenemaro Anuyah,
forms of biases.                                                              David Mcneill, and Maria Soledad Pera. 2018. All The Cool Kids, How Do They
   The limitations of our work are several. First, it is important to         Fit In? Popularity and Demographic Biases in Recommender Evaluation and
remark that the binary classification of gender is an oversimplifica-         Effectiveness. In Proceedings of the 1st ACM Conference on Fairness, Accountability
                                                                              and Transparency (ACM FAccT 2018), Vol. 81. 172–186. https://doi.org/10.18122/
tion of gender representation. The state of the art perspective of            B2GM6F
gender from both natural and social science domains is often non-        [17] Michael D. Ekstrand, Mucun Tian, Mohammed R. Imran Kazi, Hoda Mehrpouyan,
binary, where male and female are just one of the many genders in             and Daniel Kluver. 2018. Exploring Author Gender in Book Rating and Recom-
                                                                              mendation. In Proceedings of the 12th ACM Conference on Recommender Systems
which an individual may choose to identify by. Binary definitions             (RecSys ’18). 242–250. http://dl.acm.org/citation.cfm?doid=3240323.3240373
of gender have been widely critiqued to be socially constructed          [18] Andres Ferraro, Dmitry Bogdanov, Xavier Serra, and Jason Yoon. 2019. Artist and
                                                                              style exposure bias in collaborative filtering based music recommendations. In 1st
through routine gendered performances [8, 12] thereby, considering            Workshop on Designing Human-Centric MIR Systems (wsHCMIR19), co-located at
gender to be only binary in this work is both limiting and to some            20th Conference of the International Society for Music Information Retrieval (ISMIR
degree, reinforcing of such binary logic. Second, the evaluation of           2019). arXiv:1911.04827 http://arxiv.org/abs/1911.04827
                                                                         [19] Emilia Gomez, Andre Holzapfel, Marius Miron, and Bob L. Sturm. 2019. Fairness,
RS is computed such that the impact of the outcome can be intended            Accountability and Transparency in Music Information Research (FAT-MIR).
in the short- but not in the long-term. Using longitudinal data or            https://doi.org/10.5281/zenodo.3546227
simulation frameworks, we believe that a better comprehension            [20] Asela Gunawardana and Guy Shani. 2015. Evaluating Recommender Systems.
                                                                              Springer US, Boston, MA, 265–308. https://doi.org/10.1007/978-1-4899-7637-6_8
of the phenomenon can be achieved, complementing the results             [21] F. Maxwell Harper and Joseph A. Konstan. 2015. The movielens datasets: History
we have presented. Lastly, Last.fm users tend to come mostly from             and context. ACM Transactions on Interactive Intelligent Systems 5, 4 (2015), 1–19.
                                                                              https://doi.org/10.1145/2827872
Western countries, consequently our results cannot be generalized        [22] Andre Holzapfel, Bob L. Sturm, and Mark Coeckelbergh. 2018. Ethical Dimensions
to represent a global scenario. This issue is well known in the MIR           of Music Information Retrieval Technology. Transactions of the International
domain [39], and we do believe that to consider a multicultural per-          Society for Music Information Retrieval 1 (2018), 44–55.
                                                                         [23] Nicolas Hug. 2017. Surprise, a Python library for recommender systems. http:
spective is undoubtedly a necessary step to give robustness to MIR            //surpriselib.com.
studies dealing with socio-cultural and socio-technical phenomena.       [24] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac. 2015.
                                                                              What recommenders recommend: an analysis of recommendation biases and
                                                                              possible countermeasures. User Modeling and User-Adapted Interaction 25, 5
                                                                              (2015), 427–491. https://doi.org/10.1007/s11257-015-9165-3
                                                                         [25] Dietmar Jannach, Oren Sar Shalom, and Joseph A Konstan. 2019. Towards
8   ACKNOWLEDGMENTS                                                           More Impactful Recommender Systems Research. In Proceedings of the ImpactRS
This work is partially supported by the European Commission                   Workshop, 13th ACM Conference on Recommender Systems (RecSys 2019). 15–17.
under the TROMPA project (H2020 770376).
[26] Gawesh Jawaheer, Martin Szomszor, and Patty Kostkova. 2010. Comparison of                  1454012
     implicit and explicit feedback from an online music recommendation service. In        [35] Caroline Criado Perez. 2019. Invisible Women: Exposing data bias in a world
     Proceedings of the 1st International Workshop on Information Heterogeneity and             designed for men. Random House.
     Fusion in Recommender Systems, HetRec 2010, Held at the 4th ACM Conference            [36] Alan Said, Alejandro Bellogín Kouki, and A. P. deVries. 2013. A Top-N Recom-
     on Recommender Systems (RecSys 2010). 47–51. https://doi.org/10.1145/1869446.              mender System Evaluation Protocol Inspired by Deployed Systems.
     1869453                                                                               [37] Justin Salamon. 2019. What’s Broken in Music Informatics Research? Three
[27] Yehuda Koren. 2010. Factor in the Neighbors: Scalable and Accurate Collaborative           Uncomfortable Statements. In Proceedings of the 36th International Conference on
     Filtering. ACM Trans. Knowl. Discov. Data 4, 1, Article 1 (Jan. 2010), 24 pages.           Machine Learning. 2012–2014.
     https://doi.org/10.1145/1644873.1644874                                               [38] Markus Schedl. 2016. The LFM-1b Dataset for Music Retrieval and Recommenda-
[28] Dominik Kowald, Markus Schedl, and Elisabeth Lex. 2020. The Unfairness of                  tion. In Proceedings of the 2016 ACM on International Conference on Multimedia
     Popularity Bias in Music Recommendation: A Reproducibility Study. In Advances              Retrieval (New York, New York, USA) (ICMR âĂŹ16). Association for Computing
     in Information Retrieval, Joemon M Jose, Emine Yilmaz, João Magalhães, Pablo               Machinery, New York, NY, USA, 103âĂŞ110. https://doi.org/10.1145/2911996.
     Castells, Nicola Ferro, Mário J Silva, and Flávio Martins (Eds.). Springer Interna-        2912004
     tional Publishing, Cham, 35–42.                                                       [39] Xavier Serra, Michela Magas, Emmanouil Benetos, Magdalena Chudy, Simon
[29] Kun Lin, Nasim Sonboli, Bamshad Mobasher, and Robin Burke. 2019. Crank up                  Dixon, Arthur Flexer, Emilia Gómez, Fabien Gouyon, Perfecto Herrera, Sergi
     the volume: Preference bias amplification in collaborative recommendation. In              Jorda, Oscar Paytuvi, Geoffroy Peeters, Jan Schlüter, Hugues Vinet, and Gerhard
     CEUR Workshop Proceedings, Vol. 2440. arXiv:1909.06362                                     Widmer. 2013. Roadmap for Music Information ReSearch.
[30] Xin Luo, Mengchu Zhou, Yunni Xia, and Qingsheng Zhu. 2014. An Efficient               [40] Aaron Swartz. 2002. MusicBrainz: A Semantic Web Service. IEEE Intelligent
     Non-Negative Matrix-Factorization-Based Approach to Collaborative Filtering                Systems 17, 1 (Jan. 2002), 76âĂŞ77. https://doi.org/10.1109/5254.988466
     for Recommender Systems. IEEE Transactions on Industrial Informatics 10, 2            [41] Virginia Tsintzou, Evaggelia Pitoura, and Panayiotis Tsaparas. 2018. Bias Dispar-
     (2014), 1273–1284.                                                                         ity in Recommendation Systems. CoRR abs/1811.01461 (2018). arXiv:1811.01461
[31] Masoud Mansoury, Bamshad Mobasher, Robin Burke, and Mykola Pechenizkiy.                    http://arxiv.org/abs/1811.01461
     2019. Bias disparity in collaborative recommendation: Algorithmic evaluation          [42] Sarah Myers West, Meredith Whittaker, and Kate Crawford. 2019. Discriminating
     and comparison. In CEUR Workshop Proceedings, Vol. 2440. arXiv:1908.00831                  Systems: Gender, Race and Power in AI. AI Now Institute. https://ainowinstitute.
[32] Sean M McNee, John Riedl, and Joseph A Konstan. 2006. Being Accurate is                    org/discriminatingsystems.html
     Not Enough: How Accuracy Metrics Have Hurt Recommender Systems. In CHI                [43] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai Wei Chang.
     ’06 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’06).                 2017. Men also like shopping: Reducing gender bias amplification using corpus-
     Association for Computing Machinery, New York, NY, USA, 1097–1101. https:                  level constraints. EMNLP 2017 - Conference on Empirical Methods in Natural
     //doi.org/10.1145/1125451.1125659                                                          Language Processing, Proceedings (2017), 2979–2989. https://doi.org/10.18653/v1/
[33] Brett Millar. 2008. Selective hearing: Gender bias in the music preferences of             d17-1323
     young adults. Psychology of Music 36, 4 (2008), 429–445. https://doi.org/10.1177/     [44] Ziwei Zhu, Xia Hu, and James Caverlee. 2018. Fairness-Aware Tensor-Based
     0305735607086043                                                                           Recommendation. In Proceedings of the 27th ACM International Conference on
[34] Yoon Joo Park and Alexander Tuzhilin. 2008. The Long Tail of Recommender                   Information and Knowledge Management (Torino, Italy) (CIKM âĂŹ18). Asso-
     Systems and How to Leverage It. Proceedings of the 12th ACM Conference on                  ciation for Computing Machinery, New York, NY, USA, 1153âĂŞ1162. https:
     Recommender Systems (RecSys ’18) (2008), 11–18. https://doi.org/10.1145/1454008.           //doi.org/10.1145/3269206.3271795