Exploring Artist Gender Bias in Music Recommendation Dougal Shakespeare1 , Lorenzo Porcaro1 , Emilia Gómez1,2 , Carlos Castillo3 1 Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain 2 Joint Research Centre, European Commission, Seville, Spain 3 Web Science and Social Computing Group, Universitat Pompeu Fabra, Barcelona, Spain dougalian.shakespeare01@estudiant.upf.edu {lorenzo.porcaro,emilia.gomez,carlos.castillo}@upf.edu ABSTRACT the RS community [32], shared practices for evaluating the impact Music Recommender Systems (mRS) are designed to give person- of recommendations still are missing. alised and meaningful recommendations of items (i.e. songs, playlists Notwithstanding, recent years have seen a rise in awareness in or artists) to a user base, thereby reflecting and further complement- the scientific community about the implications of socio-technical ing individual users’ specific music preferences. Whilst accuracy systems’ design and implementation responsible of reinforcing bias metrics have been widely applied to evaluate recommendations in and discrimination [4, 42]. Music Information Retrieval (MIR) re- mRS literature, evaluating a user’s item utility from other impact- search is still in its early-stage with regards to the analysis of the oriented perspectives, including their potential for discrimination, ethical dimensions and impact of music technology [19, 22, 37, 39], is still a novel evaluation practice in the music domain. In this work, and several challenges still need to be tackled when approaching we center our attention on a specific phenomenon for which we MIR research from a socio-technical perspective. A common issue want to estimate if mRS may exacerbate its impact: gender bias. is the availability of data, often limited in terms of size, user in- Our work presents an exploratory study, analyzing the extent to formation or musical information, and as in many other fields, a which commonly deployed state of the art Collaborative Filtering chronic shortage of gender-disaggregated data [35]. The difficulties (CF) algorithms may act to further increase or decrease artist gen- in our research to retrieve the artists’ gender are just one example der bias. To assess group biases introduced by CF, we deploy a of this limitation, as presented in Section 3 and 4. recently proposed metric of bias disparity on two listening event We center our attention on a specific phenomenon that recom- datasets: the LFM-1b dataset, and the earlier constructed Celma’s mender systems may exacerbate: gender bias. In its broader sense, dataset. Our work traces the causes of disparity to variations in gender discrimination is a disadvantage for a group of people based input gender distributions and user-item preferences, highlighting on their gender. Far from being an emerging problem, gender dis- the effect such configurations can have on user’s gender bias after crimination has its roots in cultural practices historically related recommendation generation. with socio-political power differentials [12]. Nonetheless, the mod- ern day prevalence of gender discrimination is not to be understated: CCS CONCEPTS recent reports find the disproportionate treatment of female artists to be prevalent in the Western music industry to this day 1 . Whilst • Social and professional topics → Socio-technical systems; Gen- the cause of such treatment is multifaceted, our work traces the der; • Information systems → Collaborative filtering; Rec- influence of one factor evidenced to be present in the works of ommender systems. Millar [33] that is, the pre-existing gender bias of a music listener. In this exploratory study, we assess the extent to which Collabo- KEYWORDS rative Filtering (CF) algorithms commonly deployed in mRS may gender bias, bias disparity, music recommendation exacerbate pre-existing users’ gender biases thereby affecting an artist gender’s exposure and proportional representation. We focus 1 INTRODUCTION on the measurement of bias disparity in recommender systems, de- Impact-oriented Recommender System (RS) research is gaining fined as "[...] the case where the recommender system introduces bias attention as a novel paradigm for understanding not only how in the data, by amplifying existing biases and reinforcing stereotypes." users interact with recommendations, but also for shedding light [41]. Building on existing literature [29, 31, 41, 43], we first repro- on how these interactions can influence users’ behaviours in the duce the study presented by Lin et al. [29], in which preference bias short- and the long-term [25]. An outstanding issue when study- amplification in collaborative recommendation is analyzed using ing the possible impact of RS is the heterogeneity of evaluation the MovieLens dataset[21], a dataset of user activity with a movie procedures described in the literature. Evaluating recommender recommendation system. In our work, we focus on the music do- systems is a non-trivial task because of the multiple facets that main making use of two Last.fm2 listening event datasets publicly a good recommendation can have, and the multiple players influ- available: 1) Celma’s LFM-360k dataset [10]; 2) Schedl’s LFM-1b encing these aspects [20]. Even if the need for going beyond the dataset [38]. Our goal is twofold: on one hand, reproducing and ver- evaluation in terms of accuracy metrics has been well-recognized by ifying whether previous results [29] hold across different datasets. On the other hand, we aim at highlighting which aspects specific Proceedings of the ImpactRS Workshop at ACM RecSys ’20, September 25, 2020, Virtual Event, Brazil. 1 http://assets.uscannenberg.org/docs/aii-inclusion-recording-studio-2019.pdf Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 https://www.last.fm to the music domain can be extracted by this analysis, connecting et al. in [15] have recently proposed a novel means of mitigating with existing literature on gender bias in music preferences [3, 33]. the derivation of sensitive features (such as gender) in the latent The paper is structured as follows. Section 2 provides an overview space, using fairness constraints based on the predictability of such of previous works related to bias in Information Technology, fo- features. A similar approach proposing fairness-aware tensor-based cusing on gender bias, but also how this bias has been approached recommendation is also presented by Zhu et al. in [44]. in music-related fields. We then introduce the considered datasets, In the music domain, Aguiar et al. [2] propose a methodology LFM-1b and LFM-360K respectively in Section 3 and 4. In Section 5, to assess the extent to which artists ranked in Spotify playlists the recommendation models used and the experimental settings are are affected by gender after accounting for plausible determinants presented, followed by Section 6 which details the results obtained. of inclusion on playlists such as country, song characteristics (e.g. Lastly, in section 7 conclusions and future work are discussed. bpm, key signature), and past streaming success. The authors find that there is some evidence consistent with the presence of bias (both for and against female artists), however they do not draw 2 RELATED WORK subsequent relations between this and the disproportionate low The notion of bias has been extensively explored in the Information streaming share of female artists on the platform. In the work by Retrieval domain [4, 5, 7, 11, 24]. Typically, metrics aim to capture Anglata-Tort et al. [3], through the analysis of UK top 5 music charts relative bias (i.e. bias pre-existing in data, for example in user lis- between the years 1960-1995, authors show how popular music is tening histories in LFM-1b), and algorithmic bias (i.e. how filtering affected by a large gender inequality, showing the presence of an algorithms can result in unfair item and user treatment) to measure existing bias in the listening preferences towards male artists. Sim- disproportionate unfair treatment of a protected group. ilarly, Millar in [33], surveying a population of Australian young One of the most well-studied biases in RS literature is popularity adults, shows how music preferences are affected by gender bias, ev- bias, with the music domain being no exception to this phenomenon idencing differences between male and female listeners. In contrast, [6, 10, 28]. This describes the scenario in which a few popular items in our work we apply an auditing strategy for bias propagation are recommended frequently, while the majority of items in the showing under which conditions input preferences are reflected long-tail do not get proportional attention. Highlighted in literature in RS output, inferring music preferences from the users’ listening as a prominent issue for CF algorithms [1, 10, 34], Kowald et al. history grouped with respect to the artists’ gender. in [28] find that from a user’s perspective the groups who do not favor popular items may receive worsened recommendations in 3 THE LFM-1B DATASET terms of accuracy and calibration. Moreover, Ferraro et al. in [18] The LFM-1b dataset consists of more than one billion listening study the effect of musical styles with respect to popularity bias, events created by over 120,000 users of the music streaming plat- showing that CF approaches increase users’ exposure to popular form Last.fm [38]. In our analysis, we consider user-artist play- musical styles. counts formed by aggregating user-song listening events by com- Bias Disparity is a metric deployed to assess bias propagation mon artists. We then scale logarithmically the number of listens, across user’s and item’s group, measuring the deviation of the rec- as done in [13, 26]. We work with a filtered version of the dataset ommender output from the input preference, as detailed in Section in which: a) we remove users who listened to less than 10 unique 5.1. A first application to the RS domain was described by Tsintzou artists, and artists listened to by less than 10 users; b) we discard et al. [41], but the metric has recently gained more traction in its users whose listening history contains more than 25% of artists application to different domains. In Lin et al. [29], bias disparity is with unknown gender, to mitigate the impact of artists with missing applied to measure the extent to which state of the art CF algorithms gender in the dataset. can exacerbate pre-existing biases in the MovieLens dataset. Their User gender is represented in the dataset with three categories: findings show significant differences in bias propagation across male, female and N/A. We choose to focus only on users with self- memory- and model-based CF algorithms. declared gender, working with two final categories of user gender: Gender treatment and issues of proportional treatment in RS male and female. As shown in Table 1, distributions are highly have been considered in a range of literature, for which we highlight imbalanced towards men – 72% of the users are men. some examples. Ekstrand et al. [17] examined gender distribution Artist gender is not represented in the LFM-1b dataset, conse- of item recommendations in the book RS domain. Results prove that quently we retrieve this information from the open music ency- commonly deployed CF models differ in the gender distributions of clopedia MusicBrainz3 (MB) [40]. Code repositories to implement generated item recommendation lists, such that neighbour-based the following approach are made openly available4 alongside the approaches are shown to proportionality reflect user-item pref- acquired results of the data wrangling5 to elicit reproducibility . erences in their reading histories, whereas model-based matrix We identify five discrete categories of gender defined in the MB factorisation favor books whose author is of male gender. Further- database: male, female, other, N/A and undef. In the case of artists of more, Ekstrand et al. in [16] study the effect of recommendation gender N/A and undef, these are differentiated by artists for which algorithms on the utility for users of different gender groups, find- gender is not applicable and identifiable respectively. For bands, ing difference in effectiveness across gender groups. Such work we compute gender counts of all members and then compute an highlights that the effect in utility does not exclusively benefit large groups, implying that there may be other underlying latent factors 3 https://musicbrainz.org/ that influence recommendation accuracy. To address such issues 4 https://github.com/dshakes90/LFM-1b-MusicBrainz-Gender-Wrangler of disproportionate gender treatment in recommendations, Edizel 5 https://zenodo.org/record/3964506#.XyE5N0FKg5n LFM-1b LFM-360k more commonly composed of bands in comparison to the top-rated male female male female female artists. Users 31.4K 11.5K 94.3K 30.8K 4 THE LFM-360K DATASET % 71.67 28.33 75.40 24.60 The LFM-360k dataset [10] consists of approximately 360,000 users Artists 127K 27.3K 50.4K 10.5K listening histories from Last.fm collected during Fall 2008, present- % 82.30 17.70 82.83 17.17 ing a snapshot of listening activity for an earlier period in compari- Top-head 25.7K 4.8K 10.1K 1.5K son to the LFM-1b dataset. With respect to user gender distributions % 85.21 15.79 86.99 13.01 the proportion of users with a self-declared gender rises to 91% Long-tail 100K 22.2K 38.7K 8.5K whereas similarly to the LFM-1b dataset, artist gender is not defined. % 81.87 18.13 81.95 18.05 To resolve this, we implement the same pre-processing method- Table 1: Users’ and artists’ distributions after the filtering ology with the MB database as described for the LFM-1b dataset. process. “Top-head” artists are the top 20% of artists by play After further applying the filtering criteria previously detailed, we counts, while the remaining 80% are the “long-tail.” are able to identify 31% of artists with a known gender, a proportion notably higher than that of what we were able to identify for the LFM-1b dataset. As presented in Table 1, artist gender distributions in the filtered dataset are once again highly imbalanced towards LFM-1b artists classified as men. For users with identified gender, we again observe a high imbalance towards male users (75%) comparable to No. Male artist Plays Female artist Plays rates observed in the LFM-1b dataset. When comparing the two 1 Radiohead 2.6M Lana Del Rey 1.2M datasets we observe several additional differences and similarities 2 The Beatles 2.5M Lady Gaga 1.1M which may impact the propagation of a gender bias in artist recom- 3 Pink Floyd 2.1M Rihanna 0.8M mendations. First, the number of users is significantly larger than 4 Daft Punk 2.0M Björk 0.7M that of the LFM-1b, whilst the number of artists is much smaller. 5 Metallica 1.9M Madonna 0.6M Second, sparsity is higher in the LFM-360k dataset in comparison to the LFM-1b. Third, with regard to the top 5 artists of male and LFM-360k female gender in the dataset we observe significantly higher play- 1 Radiohead 6.2M Björk 1.3M counts for artists classified as male in comparison to the LFM-1b 2 The Beatles 5.4M Avril Lavigne 1.1M dataset, as shown in Table 1. With regard to similarities across 3 In Flames 4.9M Madonna 1.1M the two datasets, we observe that top 5 popular male artists are 4 Metallica 4.3M Britney Spears 0.9M more commonly bands in comparison to the top 5 female artists. 5 Muse 4.2M Regina Spektor 0.9M In addition, we observe that the long-tail of both datasets contains Table 2: Top 5 artists ordered by total play counts in LFM-1b significantly higher distribution of female artists, in comparison and LFM-360k datasets. to the top head reinforcing the conclusion that female artists are significantly more likely to be less popular on the Last.fm platform and hence, more likely to be less recommended as a result of this popularity bias. overall classification based on whichever count has a majority. In 5 METHODOLOGY the case of artists with gender ties (e.g, a band consisting of 2 males and 2 females), we discard such artists from our final analysis as 5.1 Evaluation Metrics gender is in this instance, deemed ambiguous. After applying this In this section, we formally outline the metrics of preference ratio, methodology, we are able to identify 27% of artists with a known- bias disparity, as well as accuracy and beyond-accuracy metrics gender. Distributions are observed to be highly imbalanced such considered during the evaluation. that artists of male gender consist of the majority (82%) of artists Preference ratio (PR). Let U be the set of n users, I be the set for which gender can be identified, as shown in Table 1. of m items and S be the nxm input matrix, where S(u, i) = 1 if user In our final analysis, we further filter artists not identified as u has selected item i, and zero otherwise. Given matrix S, the input male or female according to the procedure described above. Artists preference ratio for user group G on item category C is the fraction of gender other are discarded as we deem such data to be too sparse of liked items by group G in category C, formally defined as the to be informative in the analysis of users’ listening preferences. We following: u ∈G i ∈C S(u, i) Í Í note this group merits further future evaluation, perhaps relying PR S (G, C) = Í (1) u ∈G i ∈I S(u, i) Í on qualitative methods, and limitations of this binary approach are discussed in Section 7. Table 2 presents the top 5 artists based Bias disparity (BD). It is defined to be the relative difference on the total sum of play counts in the filtered LFM-1b dataset. We between the preference bias for input S and output of a recommen- observe a trend for male artists’ popularity, having approximately dation algorithm R. Formally we define the metric as the following: twice as much play counts as top-rated female artists/bands. We PR R (G, C) − PR S (G, C) also observe a trend for the top male artists on the platform to be BD(G, C) = (2) PR S (G, C) In our analysis, we generate a set of r ranked items, Ru which folds training set. For each of the algorithms tested, we compute all have the highest predicted ratings for a given user u, limiting the evaluation metrics and preference ratios over each fold and then value of r to 5. subsequently report average performance. In our work we set N Accuracy and beyond-accuracy metrics. To evaluate the RS = 10, M = 20 and n = 5, thereby generating top-5 recommendation performance, we additionally deploy two accuracy metrics: Preci- lists. We consider a user’s test set of size N as the sample space for sion, nDCG, and three beyond-accuracy metrics: coverage, spread recommendations to be formed. and long-tail percentage. We refer to the metrics formulation as detailed in the work by Noia et al. [14]. Precision (p@n) captures the 5.3 Experimental Design proportion of relevant items in top-N recommendations, such that We set up two experimental designs to evaluate variations in gender relevance is a binary function that represents the relevance of item bias disparity across recommended artists and user groups for the i for a user u. In our work, we consider relevant a recommendation two datasets. For all experiments detailed, code repositories are which is greater or equal to the average scaled listening count for made openly available6 . Experiment 1 is a real-world scenario in a user, after discarding outliers in the data computed using the in- which male and female gender distributions are representative of terquartile range. Although p@n is useful for analysing generated those in both datasets. Experiment 2 is an extreme scenario in which item recommendations, it does not capture accuracy aspects relat- all users have high levels of preference ratio, representing extreme ing to the rank of a recommendation. Hence, in our work we also listening preferences towards artists of a specific gender. deploy the metric nDCG, a rank sensitive metric used to evaluate Experiment 1. We generate recommendations for a sample of the accuracy of a RS. With respect to metrics beyond accuracy, we all users for which gender can be identified. In the LFM-1b dataset, utilise both spread and coverage to capture a recommender sys- we limit the size of this sample to be 30% randomly chosen of all tems ability to recommend a broad range of unique items. Such male and female users in the whole dataset (approx 12,000 users), approaches are important to consider in our work to potentially due to computational constraints. The size of the user sample for reason and explain bias propagation across artist genders. The met- the LFM-360k dataset was also constrained to be approximately the ric long-tail percentage is used to capture the proportion of item same size as samples for the LFM-1b dataset. User and artist gender recommendations which exist in the long tail. In our work, we de- distributions in both samples are representative of overall gender fine the long tail as the 80% of least popular items in the system. We distributions in the entirety of both datasets. We therefore use this use the metric to capture a filtering algorithms capacity to display experiment to consider the case of gender bias propagation under the popularity bias. a real world scenario, assessing the extent to which gender bias disparity may differ across datasets. Experiment 2. We generate recommendations only for a sam- 5.2 Recommendation Algorithms ple of male and female users which have high preference ratios in We test several commonly deployed memory- and model-based CF the dataset, thereby simulating an extreme scenario under which algorithms, following a similar approach to previous work [28, 29]. all users are highly biased towards one artist gender group in their Using Surprise [23], a Python library for recommender systems, listening preferences. For the LFM-1b dataset, we select the top we formulate our music recommendations as a rating prediction 30% of both male and female user groups with the highest max- problem where we predict the preference of a target user u for a imum input preference ratios, maintaining both the proportions target artist a. We then evaluate RS recommending the top-5 artists of male and female users in the datasets, and the sample size of with the highest predicted preferences. experiment 1. For the LFM-360k dataset, we sample users from both We consider two types of CF algorithms: (1) KNN-based ap- male and female user groups maintaining the distribution of male proach: UserKNNAvg [27], and (2) factorisation-based approach: and female users in the original dataset. The final user sample has Non-Negative Matrix Factorization (NMF) [30]. Hyperparameters approximately the same sample size as that of the LFM-1b user of UserKNNAvg and NMF are tuned to give the best performance sample. we can achieve with respect to the rank aware metric, nDCG. In Figure 1 represents the distributions of users’ input preference addition, we consider two MostPopular and UserItemAvg algorithms ratio towards male and female artist groups. For both datasets con- which respectively, recommend the most popular and highest rated sidered in this study, it shows that only around 20% of users have a artists. We consider these algorithms for a baseline comparison. preference ratio towards male artists lower than 0.8. On the con- A variation of the leave-l-out evaluation detailed in [9] is per- trary, 80% of users have a preference ratio lower than 0.2 towards formed whereby we translate the approach to evaluate a top-n RS. female artists. Due to the disproportionate amount of users with Drawing influence from the methodology of Said et al. [36] we extreme preferences for male artists across both datasets, a random define 3 parameters: (1) n, the size of the recommendation list gen- sampling methodology proposed does little to assess extreme pref- erated, (2) N , the number of items selected for each user to appear in erence towards female artists, resulting in a situation very similar the test set. N is constrained to be > n to allow for variance in item to experiment 1. To resolve this, we further limit our sample space recommendations across tested algorithms. (3) M, the minimum to only users who have extreme preference for female artists, with number of unique artists listened to by a user. M is constrained to input preference ratio towards female artists > 0.6. This results in be > N to ensure a non-empty test set is able to be formed for each a sample size reduction to 100 users for the LFM-1b dataset, and user. We construct three folds, randomly selecting for each user, 400 users for the LFM-360k dataset. Although reduced in size in N items in their listening history to belong to the fold’s test set and then subsequently removing these listening events from the 6 https://github.com/dshakes90/Last-fm-Gender-Bias-Analysis Most Popular UserItem Avg UserKNN Avg NMF precision 0.010 0.595 0.676 *0.734 nDCG 0.012 0.663 0.793 *0.880 coverage 1.7E-04 0.364 *0.558 0.552 spread 2.322 11.85 *12.84 12.72 longtail % 0 0.027 0.053 *0.054 Table 3: Experiment 1 evaluation results on the LFM-1b dataset. Values in bold represent the top value, while marked with * are results where the difference is statistically significant, according to a t-test with α = 0.05. genders. The popularity-based algorithm results in the highest lev- els of bias disparity for both male and female users, whilst the NMF and UserKNNAvg algorithms tested result in the lowest absolute levels of bias disparity with marginal difference in bias propagation across the two algorithms. Whatsmore, our findings show male users to be more affected by bias propagation in the LFM-1b dataset whilst for LFM-360K, we observe bias propagation to be greater for female users thereby inline with the findings of Lin et al. [29]. With regard to bias disparity for female artists, negative levels are ob- served for all algorithms tested. The MostPopular algorithm results in the lowest levels of bias disparity due to female artists having significantly lower popularity for both datasets tested, as shown in Table 1. We observe bias propagation to be greater for recommen- Figure 1: Input Preference Ratio (PR) distributions: LFM-1b dations generated using the LFM-1b dataset reflected in the lower (top) and LFM-360k (bottom). long-tail percentage attained. This suggests that users in the LFM-1b dataset may be more subject to a popularity bias in comparison to LFM-360k which may translate to increased levels of gender bias disparity due to female artists proportionally residing less in the comparison to experiments 1, we believe such experimental designs top-head. Together, our findings suggest that differences in bias to be fundamental to measure the extent to which the treatment propagation across the two datasets may be traced to pre-existing of users with extreme preferences differs across artist genders. Ex- bias entering the system in the form of listening events. periment 2 represents a situation opposite to the one proposed in experiment 1, thanks to which we can assess if bias propagation is not embedded in the gender per se, but is a result of pre-existing 6.2 Experiment 2 - Extreme preferences bias. Considering users with extreme preferences for female artists we observe the inverse scenario of experiment 1, such that bias dispar- 6 RESULTS ity is positive for female artists and negative towards male artists, as shown in Figure 3 and Figure 5. For both datasets, we comment 6.1 Experiment 1 - Whole population that one cause of such disparity is a dramatic imbalance in users’ We report in Figure 2 preference ratio, and in Figure 3 bias disparity listening preference, which then subsequently propagates through results obtained with the LFM-1b dataset. Figure 4 and Figure 5 to other users’ recommendations. Our findings show that such bias present preference ratio and bias disparity results respectively for propagation is not reserved for male artists on the platform and can, the LFM-360K dataset. The dotted lines in Figure 2 and Figure 4 under extreme scenarios emerge in the opposite manner. For both represent input preference ratios whereas the plot’s bars display memory- and model-based approaches tested we observe significant output preference ratios computed from generated recommenda- differences in bias disparity: NMF results in the smallest absolute tion lists. With regard to pre-existing bias, users in both datasets bias disparity increase thereby reflecting a users’ input preference, display high and low input preference ratios for male and female whereas the neighbour-based UserKNNAvg increases absolute bias artists respectively, thereby in line with the findings of Millar [33]. disparity levels towards whichever user-artist preference is in the In addition, for both artist genders input preference ratios can be majority. The tendency of NMF to propagate less bias, positively seen to be higher by users who share the same gender as the artist. or negatively speaking, in comparison to the other models is also With regard to bias propagation after recommendation, all recom- reflected in the results obtained from the beyond-accuracy metrics mendation models tested result in a positive bias disparity for male evaluation. Indeed, for experiment 2 NMF achieves the high lev- artists for which there is minimal variance in treatment across user els of coverage, recommending wider subsets of artists, and at the Figure 2: Preference Ratio (PR) results for LFM-1b dataset for experiment 1 (left column), and experiment 2 (right column). Figure 3: Bias Disparity (BD) results for LFM-1b dataset for experiment 1 (left column), and experiment 2 (right column). Figure 4: Preference Ratio (PR) results for LFM-360k dataset for experiment 1 (left column), and experiment 2 (right column). Figure 5: Bias Disparity (BD) results for LFM-360k dataset for experiment 1 (left column), and experiment 2 (right column). same time high levels of recommendation spread. Together these REFERENCES results suggest that the model-based algorithm considered in this [1] Himan Abdollahpouri, Masoud Mansoury, Robin Burke, and Bamshad Mobasher. study is capable of achieving a higher level of diversification in the 2019. The unfairness of popularity bias in recommendation. CEUR Workshop Proceedings 2440 (2019). arXiv:1907.13286 outcomes in comparison to the memory-based model. Translated to [2] Luis Aguiar, Joel Waldfogel, and Sarah Waldfogel. 2018. Playlisting Favorites: Is our scenario, it means that NMF is the algorithm that focuses less Spotify Gender-Biased? Technical Report November. https://ec.europa.eu/jrc/ sites/jrcsh/files/jrc113503.pdf on recommending a specific gender group, avoiding the exacerba- [3] Manuel Anglada-Tort, Amanda E Krause, and Adrian C North. 2019. Popular mu- tion of pre-existing bias in the dataset that other recommendation sic lyrics and musicians’ gender over time: A computational approach. Psychology algorithms exhibit. Again, the effect of bias propagation is seen to of Music (2019). https://doi.org/10.1177/0305735619871602 [4] Ricardo Baeza-Yates. 2018. Bias on the web. Commun. ACM 61, 6 (2018), 54–61. be more amplified in the case of the LFM-1b dataset. https://doi.org/10.1145/3209581 [5] Solon Barocas and Andrew D. Selbst. 2014. Big Data’s Disparate Impact. California Law Review 671 (2014), 671–732. [6] Christine Bauer and Markus Schedl. 2019. Global and country-specific main- 7 CONCLUSIONS AND FUTURE WORK streaminess measures: Definitions, analysis, and usage for improving personal- ized music recommendation systems. PLOS ONE i (2019), 1–36. Studies of gender bias in music preferences, conducted in a field [7] Engin Bozdag. 2013. Bias in algorithmic filtering and personalization. Ethics and such as Music Psychology and Gender Studies, have already ev- Information Technology 15, 3 (2013), 209–227. https://doi.org/10.1007/s10676- 013-9321-6 idenced how socio-cultural factors are responsible for disparate [8] Judith Butler. 2006. Gender Trouble. Taylor and Francis. treatment of not-male artists. In the field of MIR, relatively little [9] RocÃŋo CaÃśamares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation research has analyzed how existing technology can have a role options for recommender systems. Information Retrieval Journal 23 (03 2020). https://doi.org/10.1007/s10791-020-09371-3 in mitigating or amplifying this bias. In line with the studies on [10] Òscar Celma. 2010. Music Recommendation and Discovery: The Long Tail, Long bias disparity in the RS literature, focusing on the musical domain Fail, and Long Play in the Digital Music Space. Springer-Verlag Berlin Heidelberg. we show how recommendation outcomes can actually impact gen- [11] Henriette Cramer, Jean Garcia-Gathright, Aaron Springer, and Sravana Reddy. 2018. Assessing and addressing algorithmic bias in practice. Interactions 25, 6 der bias in music preferences. Using a binary gender classification, (2018), 58–63. https://doi.org/10.1145/3278156 where users and artists are classified as male or female, we have [12] Simone de Beauvoir. 1949. The Second Sex. Vintage Classics. [13] Sarah Dean, Sarah Rich, and Benjamin Recht. 2020. Recommendations and User shown how at different levels recommender systems can propagate Agency: The Reachability of Collaboratively-Filtered Information. In Proceedings a pre-existing bias. In addition, simulating an “upside down” world of the 3rd ACM Conference on Fairness, Accountability and Transparency (ACM where users have a much higher preference towards female artists, FAccT 2020). Barcelona, Spain, 436–445. https://doi.org/10.1145/3351095.3372866 [14] Tommaso Di Noia, Jessica Rosati, Paolo Tomeo, and Eugenio Di Sciascio. 2017. still we find evidence of an exacerbation of that bias. Our results Adaptive multi-attribute diversity for recommender systems. Information Sciences show that gender bias can be propagated by CF-based recommenda- 382-383 (2017), 234–253. https://doi.org/10.1016/j.ins.2016.11.015 tions, according to the bias present in the data. Hence, RS can have [15] Bora Edizel, Francesco Bonchi, Sara Hajian, André Panisson, and Tamir Tassa. 2019. FaiRecSys: mitigating algorithmic bias in recommender systems. In- a role in propagating bias, but at least in our exploratory study, we ternational Journal of Data Science and Analytics 9, 2 (2019), 197–213. https: have not found evidence about if they cause the emergence of new //doi.org/10.1007/s41060-019-00181-5 [16] Michael D. Ekstrand, Mucun Tian, Jennifer D. Ekstrand, Oghenemaro Anuyah, forms of biases. David Mcneill, and Maria Soledad Pera. 2018. All The Cool Kids, How Do They The limitations of our work are several. First, it is important to Fit In? Popularity and Demographic Biases in Recommender Evaluation and remark that the binary classification of gender is an oversimplifica- Effectiveness. In Proceedings of the 1st ACM Conference on Fairness, Accountability and Transparency (ACM FAccT 2018), Vol. 81. 172–186. https://doi.org/10.18122/ tion of gender representation. The state of the art perspective of B2GM6F gender from both natural and social science domains is often non- [17] Michael D. Ekstrand, Mucun Tian, Mohammed R. Imran Kazi, Hoda Mehrpouyan, binary, where male and female are just one of the many genders in and Daniel Kluver. 2018. Exploring Author Gender in Book Rating and Recom- mendation. In Proceedings of the 12th ACM Conference on Recommender Systems which an individual may choose to identify by. Binary definitions (RecSys ’18). 242–250. http://dl.acm.org/citation.cfm?doid=3240323.3240373 of gender have been widely critiqued to be socially constructed [18] Andres Ferraro, Dmitry Bogdanov, Xavier Serra, and Jason Yoon. 2019. Artist and style exposure bias in collaborative filtering based music recommendations. In 1st through routine gendered performances [8, 12] thereby, considering Workshop on Designing Human-Centric MIR Systems (wsHCMIR19), co-located at gender to be only binary in this work is both limiting and to some 20th Conference of the International Society for Music Information Retrieval (ISMIR degree, reinforcing of such binary logic. Second, the evaluation of 2019). arXiv:1911.04827 http://arxiv.org/abs/1911.04827 [19] Emilia Gomez, Andre Holzapfel, Marius Miron, and Bob L. Sturm. 2019. Fairness, RS is computed such that the impact of the outcome can be intended Accountability and Transparency in Music Information Research (FAT-MIR). in the short- but not in the long-term. Using longitudinal data or https://doi.org/10.5281/zenodo.3546227 simulation frameworks, we believe that a better comprehension [20] Asela Gunawardana and Guy Shani. 2015. Evaluating Recommender Systems. Springer US, Boston, MA, 265–308. https://doi.org/10.1007/978-1-4899-7637-6_8 of the phenomenon can be achieved, complementing the results [21] F. Maxwell Harper and Joseph A. Konstan. 2015. The movielens datasets: History we have presented. Lastly, Last.fm users tend to come mostly from and context. ACM Transactions on Interactive Intelligent Systems 5, 4 (2015), 1–19. https://doi.org/10.1145/2827872 Western countries, consequently our results cannot be generalized [22] Andre Holzapfel, Bob L. Sturm, and Mark Coeckelbergh. 2018. Ethical Dimensions to represent a global scenario. This issue is well known in the MIR of Music Information Retrieval Technology. Transactions of the International domain [39], and we do believe that to consider a multicultural per- Society for Music Information Retrieval 1 (2018), 44–55. [23] Nicolas Hug. 2017. Surprise, a Python library for recommender systems. http: spective is undoubtedly a necessary step to give robustness to MIR //surpriselib.com. studies dealing with socio-cultural and socio-technical phenomena. [24] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac. 2015. What recommenders recommend: an analysis of recommendation biases and possible countermeasures. User Modeling and User-Adapted Interaction 25, 5 (2015), 427–491. https://doi.org/10.1007/s11257-015-9165-3 [25] Dietmar Jannach, Oren Sar Shalom, and Joseph A Konstan. 2019. Towards 8 ACKNOWLEDGMENTS More Impactful Recommender Systems Research. In Proceedings of the ImpactRS This work is partially supported by the European Commission Workshop, 13th ACM Conference on Recommender Systems (RecSys 2019). 15–17. under the TROMPA project (H2020 770376). [26] Gawesh Jawaheer, Martin Szomszor, and Patty Kostkova. 2010. Comparison of 1454012 implicit and explicit feedback from an online music recommendation service. In [35] Caroline Criado Perez. 2019. Invisible Women: Exposing data bias in a world Proceedings of the 1st International Workshop on Information Heterogeneity and designed for men. Random House. Fusion in Recommender Systems, HetRec 2010, Held at the 4th ACM Conference [36] Alan Said, Alejandro Bellogín Kouki, and A. P. deVries. 2013. A Top-N Recom- on Recommender Systems (RecSys 2010). 47–51. https://doi.org/10.1145/1869446. mender System Evaluation Protocol Inspired by Deployed Systems. 1869453 [37] Justin Salamon. 2019. What’s Broken in Music Informatics Research? Three [27] Yehuda Koren. 2010. Factor in the Neighbors: Scalable and Accurate Collaborative Uncomfortable Statements. In Proceedings of the 36th International Conference on Filtering. ACM Trans. Knowl. Discov. Data 4, 1, Article 1 (Jan. 2010), 24 pages. Machine Learning. 2012–2014. https://doi.org/10.1145/1644873.1644874 [38] Markus Schedl. 2016. The LFM-1b Dataset for Music Retrieval and Recommenda- [28] Dominik Kowald, Markus Schedl, and Elisabeth Lex. 2020. The Unfairness of tion. In Proceedings of the 2016 ACM on International Conference on Multimedia Popularity Bias in Music Recommendation: A Reproducibility Study. In Advances Retrieval (New York, New York, USA) (ICMR âĂŹ16). Association for Computing in Information Retrieval, Joemon M Jose, Emine Yilmaz, João Magalhães, Pablo Machinery, New York, NY, USA, 103âĂŞ110. https://doi.org/10.1145/2911996. Castells, Nicola Ferro, Mário J Silva, and Flávio Martins (Eds.). Springer Interna- 2912004 tional Publishing, Cham, 35–42. [39] Xavier Serra, Michela Magas, Emmanouil Benetos, Magdalena Chudy, Simon [29] Kun Lin, Nasim Sonboli, Bamshad Mobasher, and Robin Burke. 2019. Crank up Dixon, Arthur Flexer, Emilia Gómez, Fabien Gouyon, Perfecto Herrera, Sergi the volume: Preference bias amplification in collaborative recommendation. In Jorda, Oscar Paytuvi, Geoffroy Peeters, Jan Schlüter, Hugues Vinet, and Gerhard CEUR Workshop Proceedings, Vol. 2440. arXiv:1909.06362 Widmer. 2013. Roadmap for Music Information ReSearch. [30] Xin Luo, Mengchu Zhou, Yunni Xia, and Qingsheng Zhu. 2014. An Efficient [40] Aaron Swartz. 2002. MusicBrainz: A Semantic Web Service. IEEE Intelligent Non-Negative Matrix-Factorization-Based Approach to Collaborative Filtering Systems 17, 1 (Jan. 2002), 76âĂŞ77. https://doi.org/10.1109/5254.988466 for Recommender Systems. IEEE Transactions on Industrial Informatics 10, 2 [41] Virginia Tsintzou, Evaggelia Pitoura, and Panayiotis Tsaparas. 2018. Bias Dispar- (2014), 1273–1284. ity in Recommendation Systems. CoRR abs/1811.01461 (2018). arXiv:1811.01461 [31] Masoud Mansoury, Bamshad Mobasher, Robin Burke, and Mykola Pechenizkiy. http://arxiv.org/abs/1811.01461 2019. Bias disparity in collaborative recommendation: Algorithmic evaluation [42] Sarah Myers West, Meredith Whittaker, and Kate Crawford. 2019. Discriminating and comparison. In CEUR Workshop Proceedings, Vol. 2440. arXiv:1908.00831 Systems: Gender, Race and Power in AI. AI Now Institute. https://ainowinstitute. [32] Sean M McNee, John Riedl, and Joseph A Konstan. 2006. Being Accurate is org/discriminatingsystems.html Not Enough: How Accuracy Metrics Have Hurt Recommender Systems. In CHI [43] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai Wei Chang. ’06 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’06). 2017. Men also like shopping: Reducing gender bias amplification using corpus- Association for Computing Machinery, New York, NY, USA, 1097–1101. https: level constraints. EMNLP 2017 - Conference on Empirical Methods in Natural //doi.org/10.1145/1125451.1125659 Language Processing, Proceedings (2017), 2979–2989. https://doi.org/10.18653/v1/ [33] Brett Millar. 2008. Selective hearing: Gender bias in the music preferences of d17-1323 young adults. Psychology of Music 36, 4 (2008), 429–445. https://doi.org/10.1177/ [44] Ziwei Zhu, Xia Hu, and James Caverlee. 2018. Fairness-Aware Tensor-Based 0305735607086043 Recommendation. In Proceedings of the 27th ACM International Conference on [34] Yoon Joo Park and Alexander Tuzhilin. 2008. The Long Tail of Recommender Information and Knowledge Management (Torino, Italy) (CIKM âĂŹ18). Asso- Systems and How to Leverage It. Proceedings of the 12th ACM Conference on ciation for Computing Machinery, New York, NY, USA, 1153âĂŞ1162. https: Recommender Systems (RecSys ’18) (2008), 11–18. https://doi.org/10.1145/1454008. //doi.org/10.1145/3269206.3271795