Recommender Systems Fairness Evaluation via Generalized Cross Entropy∗ Yashar Deldjoo Vito Walter Anelli Polytechnic University of Bari, Italy Polytechnic University of Bari, Italy yashar.deldjoo@poliba.it vitowalter.anelli@poliba.it Hamed Zamani Alejandro Bellogín Tommaso Di Noia University of Massachusetts Autonomous Polytechnic University of Bari, Italy Amherst, USA University of Madrid, Spain tommaso.dinoia@poliba.it zamani@umass.edu alejandro.bellogin@uam.es ABSTRACT race). For example, Ekstrand et al. [16] studied whether RS produce Fairness in recommender systems has been considered with respect equal utility for users of different demographic groups. The authors to sensitive attributes of users (e.g., gender, race) or items (e.g., rev- find demographic differences in measured effectiveness across two enue in a multistakeholder setting). Regardless, the concept has been datasets from different domains. Yao and Huang [28] studied various commonly interpreted as some form of equality – i.e., the degree to types of unfairness that can occur in collaborative filtering models which the system is meeting the information needs of all its users in where, to produce fair recommendations, the authors proposed to an equal sense. In this paper, we argue that fairness in recommender penalize algorithms producing disparate distributions of prediction systems does not necessarily imply equality, but instead it should error. For additional resources see [8, 9, 17, 30, 31]. consider a distribution of resources based on merits and needs. We Nonetheless, although less common, there are a few works where present a probabilistic framework based on generalized cross entropy fairness has been defined beyond uniformity. For instance, in [5], the to evaluate fairness of recommender systems under this perspective, authors proposed an approach focused on mining the relation be- where we show that the proposed framework is flexible and explana- tween relevance and attention in Information Retrieval by exploiting tory by allowing to incorporate domain knowledge (through an ideal the positional bias of search results. That work promotes the notion fair distribution) that can help to understand which item or user as- that ranked subjects should receive attention that is proportional to pects a recommendation algorithm is over- or under-representing. their worthiness in a given search scenario and achieve fairness of Results on two real-world datasets show the merits of the proposed attention by making exposure proportional to relevance. Similarly, a evaluation framework both in terms of user and item fairness. framework formulation of fairness constraints is presented in [26] on rankings in terms of exposure allocation, both with respect to group KEYWORDS fairness constraints and individuals. Another approach where non- recommender systems; fairness; metric; Generalized cross entropy, uniform fairness has been used is the work proposed in [29], where evaluation the authors aim to solve the top-k ranking problem by optimizing a fair utility function under two conditions: in-group monotonicity 1 INTRODUCTION AND CONTEXT (i.e., rank more relevant items above less relevant within the group) Recommender systems (RS) are widely applied across the modern and group fairness (proportion of protected group items in the top-k Internet, in e-commerce websites, movies and music streaming plat- ranking should be above a minimum threshold). In summary, even forms, or on social media to point users to items (products or ser- though these approaches use some notion of non-uniformity, they vices) [16]. For evaluation of RS, accuracy metrics are typically em- are applied under different perspectives and purposes. ployed, which measure how much the presented items will be of In the present work, we argue that fairness does not necessarily im- interest to the target user. One commonly raised concern is how much ply equality between groups, but instead proper distribution of utility the recommendations produced by RS are fair. For example, do users (benefits) based on merits and needs. To this end, we present a prob- of certain gender or race receive fair utility (i.e., benefit) from the rec- abilistic framework for evaluating RS fairness based on attributes of ommendation service? To answer this question, one has to recognize any nature (e.g., sensitive or insensitive) for both items or users and the multiple stakeholders involved in such systems and that fairness show that the proposed framework is flexible enough to measure fair- issues can be studied for more than one group of participants [8]. In ness in RS by considering fairness as equality or non-equality among a job recommendation scenario, for instance, these multiple groups groups, as specified by the system designer or any other parties in- can be the job seekers and prospective employers where fairness volved in multistakeholder setting. As we shall see later, the discussed toward both parties has to be recognized. Moreover, fairness in RS approaches are different from our proposal in that we are able to ac- can be measured towards items or users; in this context, user and commodate different notions of fairness, not only ranking, e.g., rating, item fairness are commonly associated with an equal chance for ranking and even-beyond accuracy metrics. In fact, the main advan- appearing in the recommendation results (items) or receiving results tage of our framework is to provide the system designer with a high of the same quality (users). As an example for the latter, an unfair degree of flexibility on defining fairness from multiple viewpoints. system may discriminate against users of a particular race or gender. Results on two real-world datasets show the merits of the proposed One common characteristic of the previous literature focusing evaluation framework, both in terms of user and item fairness. on RS fairness evaluation is that fairness has been commonly inter- preted as some form of equality across multiple groups (e.g., gender, 2 EVALUATING FAIRNESS IN RS In this section, we propose a framework based on generalized cross ∗ Copyright 2019 for this paper by its authors. Use permitted under Creative Commons entropy for evaluating fairness in recommender systems. Let U and License Attribution 4.0 International (CC BY 4.0). Presented at the RMSE workshop held in conjunction with the 13th ACM Conference I denote a set of users and items, respectively. Suppose A be a set on Recommender Systems (RecSys), 2019, in Copenhagen, Denmark. of sensitive attributes in which fairness is desired. Each attribute can Table 1: A set of 6 users belonging to groups д1 and д2 and 10 items along with their true labels marked by ✓and recommended items by recommenders Rec 0, Rec 1, Rec 2. Rec 0 produces 3 and 6 relevant items for free and premium users (in total) respectively; Rec 1 generates 1 relevant item for each user; Rec 2 produces recommended items that are all relevant for all users. i 1 i 2 i 3 i 4 i 5 i 6 i 7 i 8 i 9 i 10 Rec 0 Rec 1 Rec 2 a 1 user 1 ✓ ✓ ✓ {i 1 , i 6 , i 8 } {i 1 , i 5 , i 9 } {i 1 , i 3 , i 7 } a 1 user 2 ✓ ✓ {i 2 , i 5 , i 9 } {i 2 , i 5 , i 7 } {i 1 , i 5 , i 8 } a 1 user 3 ✓ ✓ {i 1 , i 6 , i 7 } {i 2 , i 5 , i 9 } {i 2 , i 7 , i 9 } a 2 user 4 ✓ ✓ ✓ {i 3 , i 4 , i 9 } {i 4 , i 5 , i 6 } {i 3 , i 4 , i 9 } a 2 user 5 ✓ ✓ ✓ {i 1 , i 5 , i 7 } {i 1 , i 2 , i 10 } {i 5 , i 7 , i 10 } a 2 user 6 ✓ ✓ ✓ ✓ {i 2 , i 6 , i 9 } {i 1 , i 5 , i 8 } {i 3 , i 6 , i 9 } Table 2: Fairness of different recommenders in the toy other scenarios, a uniform definition of pf might be desired. Gen- example presented in Table 1 according to proposed GCE erally, when fairness is equivalent to equality, then pf should be and individual-level accuracy metrics. Note that pf0 = [ 12 , 12 ] uniform and in that case, the generalized cross entropy would be the and pf1 = [ 23 , 31 ],pf2 = [ 13 , 23 ] characterize the fair distribution as same as generalized entropy (see [27] for more information). uniform or non-uniform distributions between two groups. GCE (pf , p, α = −1) 2.2 Estimating Performance Distribution p P@3 R@3 pf0 pf1 pf2 The performance distribution p should be estimated based on the Rec 0 0.0800 0.3025 0.0025 1 1 . 19 = 0.530 output of the recommender system on a test set. In the following, 2 6 6 we explain how we can compute this distribution for item attributes. Rec 1 0 0.0625 0.0625 1 1 . 9 = 0.375 3 6 4 We define the recommendation gain (rдi ) for each item i as follows: 1 23 6 . 4 = 0.958 ϕ(i, RecuK ) д(u,i,r ) Õ Rec 2 0.0078 0.1182 0.0244 1 rдi = (3) u ∈U be defined for either users, e.g., gender and race, or items, e.g., item where RecuK is the set of top-K items recommended by the system to provider (or stakeholder). the user u ∈U . ϕ(i, RecuK ) = 1 if item i is present in RecuK ; otherwise The goal is to find an unfairness measure I that produces a non- ϕ(i, Recuk ) = 0. The function д(u,i,r ) is the gain of recommending negative real number for a recommender system. A recommender item i to user u with the rank r . Such gain function can be defined system M is considered less unfair (i.e., more fair) than M ′ with in different ways. In its simplest form, when д(u,i,r ) = 1, the rec- respect to the attribute a ∈ A if and only if |I (M,a)| < |I (M ′,a)|. Pre- ommendation gain in Eq. 3 would boil down to recommendation vious works have used inequality measures to evaluate algorithmic count (i.e., rдi =rc i ). A binary gain in which д(u,i,r ) = 1 when item unfairness, however, we argue that fairness does not always imply i recommended to user u is relevant and д(u,i,r ) = 0 otherwise, is an- equality. For instance, let us assume that there are two types of users other simple form of the gain function based on relevance. The gain in the system – regular (free registration) and premium (paid) – and functionд can be also defined based on ranking information, i.e., rec- the goal is to compute fairness with respect to the users’ subscrip- ommending relevant items to users in higher ranks is given a higher tion type. In this example, it might be more fair to produce better gain. In such case, we recommend the discounted cumulative gain recommendations for paid users, therefore, equality is not always (DCG) function that is widely used in the definition of NDCG [20], equivalent to fairness. We define fairness of a recommender system 2rel(u,i )−1 where rel(u,i) denotes the relevance label for the given by log as the generalized cross entropy (GCE) for some parameter α , 0,1: 2 (r +1) 1 ∫  user-item pair u and i. We can further normalize the above formula I (M,x) = pfα (x)p (1−α ) (x) dx −1 (1) based on the ideal DCG for user u to compute the gain function д. α(1−α) The performance probability distribution p is then proportional where p and pf respectively denote the probability distribution of to the recommendation gain for the items associated to an attribute the system performance and the fair probability distribution, both value a j . Formally, the performance probability p(a j ) used in Eq. (2) with respect to the attribute x = a [6]. The unfairness measure I is is computed as: p(a j ) = i ∈a j rдi /Z where Z is a normalization fac- Í minimized with respect to attribute x =a when p =pf , meaning that tor set equal to Z = i rдi to make sure that p(a j ) = 1. Under an Í Í the performance of the system is equal to the performance of a fair analogous formulation, we could define a variation of fairness for system. In the next sections, we discuss how to obtain or estimate users based on Eq. (3): these two probability distributions. If the attribute a is discrete or ϕ(i, RecuK ) д(u,i,r ) Õ categorical (as typical attributes, such as gender or race), then the rдu = (4) unfairness measure is defined as: i ∈I Õ  where in this case, the gain function cannot be reduced to 1, other- 1 α (1−α ) wise, all users would receive the same recommendation gain rдu . I (M,a) =   p (a j )p (a j )−1 (2) α(1−α)  a f   j   3 TOY EXAMPLE 2.1 Fair Distribution pf For the illustration of the proposed concept, in Table 1 we provide a toy example on how our approach for fairness evaluation framework The definition of a fair distribution pf is problem-specific and should could be applied in a real recommendation setting. A set of six users be determined based on the problem or target scenario in hand. For belonging to two groups (each group is associated with an attribute example, a job or music recommendation website may want to en- value a 1 (red) or a 2 (green)) who are interacting with a set of items sure that its premium users, who pay for their subscription, would are shown in Table 1. Let us assume the red group represents users receive more relevant recommendations. In this case, pf should be with a regular (free registration) subscription type on an e-commerce non-uniform across the user classes (premium versus free users). In website while the green group represents users with a premium (paid) subscription type. A set of recommendations produced by different Table 3: Results of applying the proposed fairness evaluation systems (Rec0, Rec1, and Rec2) are shown in the last columns. The metrics on Xing-REC 17 winner submission to identify item- goal is to compute fairness using the proposed fairness evaluation centered fairness for the attribute membership type (regular metric based on GCE given by Eq. (2). The results of evaluation using v.s. premium). Note that in this case, it is desired to increase three different evaluation metrics are shown in Table 2. The metrics the utility of recommendation for premium (paid) users. used for the evaluation of fairness and accuracy of the system in- pf0 = [1/2,1/2] (uniform) v.s. pf2 = [1/3,2/3] (non-uniform). clude: (i) GCE (absolute value), (ii) Precision@3 and (iii) Recall@3. Membership type GCE (pf , p, α = 2) Note that GCE = 0 means the system is completely fair, and the closer regular premium pf0 pf2 the value is to zero, the more fair the respective system is. RSC Winner 4,108,771 547,029 0.2926 0.6786 By looking at the recommendation results of Rec0, one can note that if fairness is defined in a uniform way between two groups, defined Random 4,209,878 445,759 0.3269 0.7335 through fair distribution pf = [ 21 , 12 ], then Rec0 is not a completely Table 4: Results of applying the proposed fairness evalua- fair system, since GCE = 0.08 , 0. In contrast, if fairness is defined as tion metrics on Xing-REC 17 winner submission to identify providing recommendation of higher utility (usefulness) to green users item-centered fairness. GCE1 and GCE2 have associated fair who are users with paid premium membership type, (e.g., by setting pf probability distributions equal to pf0 = [0.25,0.25,0.25,0.25], pf1 = [ 13 , 23 ]) then since GCE ≈ 0, we can say that recommendations pro- = [0.7,0.1,0.1,0.1], pf2 = [0.1,0.7,0.1,0.1], pf3 = [0.1,0.1,0.7,0.1], pf4 = duced by Rec0 are fair. Both of the above conclusions are drawn with [0.1,0.1,0.1,0.7] where pf0 defines fair distribution as uniform respect to attribute “subscription type” (with categories free/paid distribution while the rest define it as favoring each of groups premium membership). This is an interesting insight which shows Country GCE (pf , p, α = 2) the evaluation framework is flexible enough to capture fairness based German Austrian Swiss Other pf0 pf1 pf2 pf3 pf4 on the interest of system designer by defining what she considers winner 3.9M 156.1K 329.4K 186.4K -0.979 0.061 3.194 3.177 3.192 random 3.8M 253.6K 319.6K 239.8K -0.883 0.038 2.945 2.937 2.946 as fair recommendation through the definition of pf . While in many Education GCE (pf , p, α = 2) application scenarios we may define fairness as equality among dif- NA BSc MSc PhD pf0 pf1 pf2 pf3 pf4 winner 2.8M 607.2K 1.0M 158.3K 0.398 0.0974 1.673 1.547 1.741 ferent classes (e.g., gender, race), in some scenarios (such as those random 3.0M 428.4K 1.0M 203.4K 0.450 0.0887 1.838 1.670 1.866 where the target attribute is not sensitive, e.g., regular v.s. premium users) fairness may not be equivalent to equality. Furthermore, by comparing the performance results of Rec1 and wanted the training set to be as close as possible to an on-line real sce- Rec2, we observe that, even though precision and recall improve for nario in which the recommender system is deployed, with this goal in Rec2 and becomes the most accurate recommendation list, it fails mind we used a time-aware splitting. The most rigorous one would be to keep a decent amount of fairness with respect to any parameter the fixed-timestamp splitting method [10, 18]. In these experiments, settings of GCE, as in both cases it is outperformed by the other however, we adopted the methodology proposed in [4] where a single methods. Moreover, GCE never reaches the optimal value, which timestamp is chosen, which represents the moment when test users in this case is attributed to the unequal distribution of resources are on the platform waiting for recommendations. The training set among classes, since there are more relevant items on green than corresponds to the past interactions, and the performance is evalu- red users. This evidences that optimizing an algorithm to produce ated with data which correspond to future interactions. The splitting relevant recommendations does not necessarily result in more fair timestamp is selected to maximize the number of users involved in recommendation rather, conversely, a trade-off between the two the evaluation according to two constraints: the training should re- evaluation properties can be noticed. tain at least 15 ratings, and the test set should contain at least 5 ratings. 4.2 Experimental Setup 4 EXPERIMENTS AND RESULTS Two recommendation scenarios are considered to evaluate the effec- In the section, we discuss our experimental setup and the results. tiveness of the proposed fairness evaluation framework with respect to item-centric or user-centric notion of fairness [8]. 4.1 Data Descriptions Item fairness evaluation: It applies the proposed fairness eval- We conduct experiments on two real-world datasets, Xing job recom- uation metrics based on GCE on the winner of the ACM RecSys mendation dataset [2] and Amazon Review dataset [1]. The datasets Challenge 2017. The challenge was formulated as “given a job posting, represent different item recommendation scenarios for job and e- recommend a list of candidates that are suitable candidates for the job”. commerce domains. We used Xing dataset to study the item-related As such, the user candidates are considered as target items for recom- notion of fairness, while Amazon is used to study the user-related mendation. In order to compute GCE, we used Eq. (3) by considering notion of fairness. a simplified case д(u,i,r ) = 1, in which the recommendation gain rдi Xing Job Recommendation Dataset (Xing-REC 17): The dataset boils down to recommendation count rc i for item i, i.e., the number was first released by XING as part of the ACM RecSys Challenge of times each user appears in the recommendation lists of all jobs. 2017 for a job recommendation task [2]. The dataset contains 320M We compare two recommendation approaches: the winner sub- of interactions happened in over 3 months. The reason for choosing mission and a random submission, and evaluate the systems’ fairness this dataset is that it provides several user-related attributes, such as from the perspective of users membership types, education, and lo- membership types (regular vs. premium), education degree, and work- cation. As for membership type, premium users (or paid members) ing country, that can be useful for the study of fairness. For example, are expected to receive better quality of recommendation. membership type allows us to study the non-equal (non-uniform) User fairness evaluation: Here, we experiment with the more tra- notion of fairness, as a recruiter may want to ensure premium users ditional item recommendation task where we study the user fairness obtain better quality in their recommendations. dimension. We consider a scenario where a business owner may Amazon: We used the toy and games subset which contains 53K pref- want to ensure superior recommendation quality for its more en- erence scores by 1K users for 24K items, with a sparsity of 99.8%. We gaged users over less engaged (or new) users (or vice versa). In order to have a more intuitive sense about how fair different recommenda- wants to promote those users with BSc or PhD, the GCE would show tion models are recommending to users of different classes, we study that the winner submission provides better recommendations to the fairness of different CF recommendation models with respect to those users than the random submission. To the best of our knowl- users’ interactions, defined in 4 categories: (i) very inactive (VIA), (ii) edge, this is the first fairness-oriented evaluation metric that allows to slightly inactive (SIA), (iii) slightly active (SA), and (iv) very active capture these nuances, which as a consequence, helps on understand- (VA). For each user, we compute the score n R (u) that corresponds to ing how the recommendation algorithms work on each user group. the total number of ratings provided by user u. We group the users Now Table 5 shows the results for the proposed user fairness eval- in four groups according to the quartile that this score belongs to. uation as described in Section 4.2. We observe in this table that each We have experimented with several recommendation models such recommender obtains a GCE value on a different range, an obvious as UserKNN [7], ItemKNN [25] (considering binarized and cosine consequence of the different performance obtained in each case for similarity metric, Jaccard coefficient [15], and Pearson correlation the different groups (as we observe in the NDCG@10 columns for [19]), SVD++ [21, 22], BPRMF [22, 24], BPRSlim [23], and two non- each user type). For instance, BPRMF is the one found by GCE to personalized models, most-popular and random recommender. perform in a fair way when assuming uniformity with respect to the For comparison with the proposed GCE metric, we include two user groups (pf0 ), however, if the system designer aims to promote complementary baseline metrics based on the absolute deviation those recommenders that provide better suggestions to the most between the mean ratings of different groups as defined in [31] active users then Random, followed by ItemKNN and SVD++ are the MAD(R (i) , R (j) ) = RR(i ) − RR(j) where R (i) denotes the predicted Í (i ) Í (j) most fair algorithms. | | | | Comparing MAD against GCE, we observe that MAD-ranking ratings for all user-item combinations in group i and R (i) is its produces lower results when NDCGs in each class are close to each other (e.g., in the case of Random recommender), which corresponds size. Larger values for MAD mean larger differences between the to the already discussed notion of fairness as equality/uniformity; groups, interpreted as unfairness. Given that our proposed GCE in similarly, MAD-rating obtains better results for the random algo- user-fairness evaluation is based on NDCG, we adapt this defini- rithm because, as expected, such method has no inherent bias with tion to also compare between average NDCG for each group. We respect to the defined user groups, but also for SVD++, probably be- refer to these two baselines as MAD-rating and MAD-ranking. cause this recommender tends to predict ratings in a small range. In Finally, the reported MAD corresponds to the average MAD be- both cases, it becomes evident that MAD, in contrast to our proposed tween all the pairwise combinations within the groups involved, i.e., GCE metric, cannot incorporate other definitions of fairness in its MAD = avgi, j (MAD(R (i) ,R (j) )). computation, hence, its flexibility is very limited. 4.3 Results and Discussion In summary, we have shown that our proposed fairness evalua- tion metric is able to unveil whether a recommendation algorithm We start our analysis with the results for the item fairness evalu- satisfies our definition of fairness, where we argue that it should ation as described in Section 4.2, presented in Tables 3 and 4. The emphasize a proper distribution of utility based on merits and needs. counts in these tables represent the total number of users with a We demonstrate this in both notions of fairness: based on users given category that each submission recommends. We observe in and based on items. Therefore, we conclude that this metric could Table 4 that recommendations produced by the RecSys Challenge help better explaining the results of the algorithms towards specific winner performs better with pf0 than with pf2 , since the GCE value is groups of users and items, and as a consequence, it could increase closer to 0. This evidences that the proposed winner system produces the transparency of the recommender systems evaluation. balanced recommendations across the two membership classes. This is in contrast to our expectation that premium users should be pro- vided better recommendations. Therefore, even though the winning submission could produce higher recommendation quality from a 5 CONCLUSION global perspective, it does not comply with our expectation of a fair Fairness-aware recommendation research requires appropriate eval- recommendation for this attribute, which is to recommend better uation metrics to quantify fairness. Furthermore, fairness in RS can recommendations to premium users. be associated with either items or users, even though this comple- Furthermore, in Table 4 we present the recommendation fairness mentary view has been underrepresented in the literature. In this evaluation results using GCE across two other attributes: Country work, we have presented a probabilistic framework to measure fair- and Education; each of these attributes takes 4 categories. We define ness of RS under the perspective of users and items. Experimental five variations of the fair distribution pf : while pf0 considers all at- results on two real-world datasets show the merits of the proposed tribute categories equally important, the others give one attribute evaluation framework. In particular, one of the key aspects of our category a higher importance compared to the rest. After applying proposed evaluation metric is its transparency and flexibility, since the GCE on the winner submission, we observe that with respect to it allows to incorporate domain knowledge (by means of an ideal fair the Country attribute, the lowest value of GCE (best case) is produced distribution) that helps on understanding which item or user aspects for the German companies (GCE = 0.061) while for the Education the recommendation algorithms are over- or under-representing. attribute the category Unknown (GCE = 0.97) produces the best In the future, we plan to exploit the proposed fairness and rele- outcome, in both cases, these categories are the most frequently rec- vance aware evaluation system to build recommender systems that ommended by the analyzed submission. These results show that for a directly optimize for this objective criterion. Also, it is of our inter- given target application, if the system is looking for candidates with est to consider studying various fairness of recommendation under certain nationality (in this case, German) or education-level (here various content-based filtering or CF models using item content as any), the system recommendations coming from the winner submis- side information [11, 12] on different domains (e.g., tourism [3], en- sion are closer to a fair system. In fact, due to the inherent biases in tertainment [14], social recommendation among others). Finally, we the dataset, the random submission is obtaining better results accord- are considering to investigate the robustness of CF models against ing to our definition of fairness for several of the fair distributions shilling attacks [13] crafted to undermine not only the accuracy of analyzed. However, it is worth mentioning that if the system designer recommendations but also fairness of these models. Table 5: Results of applying the proposed fairness evaluation metrics on Amazon dataset to identify user-centered fairness. The fair probability distributions are defined as pfi so that pfi (j) = 0.1 when j ,i and 0.7 otherwise, except for pf0 that denotes the uni- form distribution; i.e., just as in Table 4. Types of users (VIA/SIA/SA/VA) as defined in Section 4.2. Best values per column in bold. NDCG@10 GCE (p f , p, α = −1) MAD VIA SIA SA VA p f0 p f1 p f2 p f3 p f4 rating ranking Random 0.0000 0.0000 0.0000 0.0005 1.5000 4.5000 4.5000 4.5000 0.2143 0.0000 0.0003 MostPopular 0.0000 0.0006 0.0013 0.0014 0.2435 1.3586 1.2289 0.6714 0.5825 0.1864 0.0008 ItemKNN 0.0023 0.0021 0.0016 0.0036 0.0487 0.6218 0.6722 0.7537 0.2636 0.0254 0.0011 UserKNN 0.0031 0.0040 0.0037 0.0053 0.0214 0.6483 0.5379 0.5783 0.3319 0.0375 0.0012 BPRMF 0.0022 0.0025 0.0028 0.0016 0.0191 0.5496 0.4767 0.3881 0.6642 0.2078 0.0006 BPRSlim 0.0027 0.0023 0.0035 0.0017 0.0353 0.5377 0.6150 0.3267 0.7267 9.0009 0.0010 SVD++ 0.0025 0.0025 0.0025 0.0042 0.0324 0.6336 0.6382 0.6361 0.2750 0.0027 0.0009 6 ACKNOWLEDGEMENTS Evaluation and Effectiveness. In Conference on Fairness, Accountability and Transparency. 172–186. This work was supported in part by the Center for Intelligent Infor- [17] Golnoosh Farnadi, Pigi Kouki, Spencer K. Thompson, Sriram Srinivasan, and mation Retrieval and in part by project TIN2016-80630-P (MINECO). Lise Getoor. 2018. A Fairness-aware Hybrid Recommender System. CoRR Any opinions, findings and conclusions or recommendations ex- abs/1809.09030 (2018). arXiv:1809.09030 [18] Asela Gunawardana and Guy Shani. 2015. Evaluating Recommender Systems. pressed in this material are those of the authors and do not neces- In Recommender Systems Handbook. Springer, 265–308. sarily reflect those of the sponsors. [19] Jonathan L. Herlocker, Joseph A. Konstan, and John Riedl. 2002. An Empirical Analysis of Design Choices in Neighborhood-Based Collaborative Filtering Algorithms. Inf. Retr. 5, 4 (2002), 287–310. REFERENCES [20] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation [1] 2014 (accessed June 5, 2019). Amazon product data. http://jmcauley.ucsd.edu/ of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), data/amazon/. 422–446. [2] Fabian Abel, Yashar Deldjoo, Mehdi Elahi, and Daniel Kohlsdorf. 2017. RecSys [21] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted Challenge 2017: Offline and Online Evaluation. In Proc. RecSys (RecSys ’17). ACM, collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International New York, NY, USA, 372–373. Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, [3] Jens Adamczak, Gerard-Paul Leyson, Peter Knees, Yashar Deldjoo, Far- August 24-27, 2008. ACM, 426–434. shad Bakhshandegan Moghaddam, Julia Neidhardt, Wolfgang Wörndl, and [22] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Philipp Monreal. 2019. Session-Based Hotel Recommendations: Challenges and Techniques for Recommender Systems. Computer 42, 8 (2009), 30–37. Future Directions. arXiv preprint arXiv:1908.00071 (2019). [23] Xia Ning and George Karypis. 2011. SLIM: Sparse Linear Methods for Top-N Recom- [4] Vito Walter Anelli, Tommaso Di Noia, Eugenio Di Sciascio, Azzurra Ragone, and mender Systems. In 11th IEEE International Conference on Data Mining, ICDM 2011, Joseph Trotta. 2019. Local Popularity and Time in top-N Recommendation. In Vancouver, BC, Canada, December 11-14, 2011. IEEE Computer Society, 497–506. Advances in Information Retrieval - 41st European Conference on IR Research, ECIR [24] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2019, Cologne, Germany, April 14-18, 2019, Proceedings, Part I. 861–868. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI 2009, [5] Asia J. Biega, Krishna P. Gummadi, and Gerhard Weikum. 2018. Equity of Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Attention: Amortizing Individual Fairness in Rankings. In Proc. SIGIR, Ann Arbor, Montreal, QC, Canada, June 18-21, 2009. AUAI Press, 452–461. MI, USA, July 08-12, 2018. ACM, 405–414. [25] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2000. Analysis [6] Zdravko I Botev and Dirk P Kroese. 2011. The generalized cross entropy method, of recommendation algorithms for e-commerce. In Proceedings of the 2nd ACM with applications to probability density estimation. Methodology and Computing conference on Electronic commerce. ACM, 158–167. in Applied Probability 13, 1 (2011), 1–27. [26] Ashudeep Singh and Thorsten Joachims. 2018. Fairness of Exposure in Rankings. [7] John S. Breese, David Heckerman, and Carl Myers Kadie. 1998. Empirical Analysis In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Dis- of Predictive Algorithms for Collaborative Filtering. In UAI ’98: Proceedings of covery & Data Mining, KDD 2018, London, UK, August 19-23, 2018. ACM, 2219–2228. the Fourteenth Conference on Uncertainty in Artificial Intelligence, University of [27] Till Speicher, Hoda Heidari, Nina Grgic-Hlaca, Krishna P Gummadi, Adish Wisconsin Business School, Madison, Wisconsin, USA, July 24-26, 1998. Morgan Singla, Adrian Weller, and Muhammad Bilal Zafar. 2018. A Unified Approach Kaufmann, 43–52. to Quantifying Algorithmic Unfairness: Measuring Individual &Group Unfairness [8] Robin Burke. 2017. Multisided fairness for recommendation. arXiv preprint via Inequality Indices. In Proceedings of the 24th ACM SIGKDD International arXiv:1707.00093 (2017). Conference on Knowledge Discovery & Data Mining. ACM, 2239–2248. [9] Robin Burke, Nasim Sonboli, and Aldo Ordonez-Gauger. 2018. Balanced [28] Sirui Yao and Bert Huang. 2017. Beyond parity: Fairness objectives for collaborative Neighborhoods for Multi-sided Fairness in Recommendation. In Conference on filtering. In Advances in Neural Information Processing Systems. 2921–2930. Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New [29] Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, York, NY, USA (Proceedings of Machine Learning Research), Vol. 81. PMLR, 202–214. and Ricardo A. Baeza-Yates. 2017. FA*IR: A Fair Top-k Ranking Algorithm. [10] Pedro G. Campos, Fernando Díez, and Iván Cantador. 2014. Time-aware recom- In Proceedings of the 2017 ACM on Conference on Information and Knowledge mender systems: a comprehensive survey and analysis of existing evaluation Management, CIKM 2017, Singapore, November 06 - 10, 2017. ACM, 1569–1578. protocols. User Model. User-Adapt. Interact. 24, 1-2 (2014), 67–119. [30] Yong Zheng, Tanaya Dave, Neha Mishra, and Harshit Kumar. 2018. Fairness In [11] Yashar Deldjoo, Mihai Gabriel Constantin, Hamid Eghbal-Zadeh, Bogdan Ionescu, Reciprocal Recommendations: A Speed-Dating Study. In Adjunct Publication of Markus Schedl, and Paolo Cremonesi. 2018. Audio-visual encoding of multimedia the 26th Conference on User Modeling, Adaptation and Personalization, UMAP 2018, content for enhancing movie recommendations. In Proc. RecSys, RecSys 2018, Singapore, July 08-11, 2018. ACM, 29–34. Vancouver, BC, Canada, October 2-7, 2018. ACM, 455–459. [31] Ziwei Zhu, Xia Hu, and James Caverlee. 2018. Fairness-Aware Tensor-Based [12] Yashar Deldjoo, Maurizio Ferrari Dacrema, Mihai Gabriel Constantin, Hamid Recommendation. In Proc. CIKM, CIKM 2018, Torino, Italy, October 22-26, 2018. Eghbal-zadeh, Stefano Cereda, Markus Schedl, Bogdan Ionescu, and Paolo ACM, 1153–1162. Cremonesi. 2019. Movie genome: alleviating new item cold start in movie recommendation. User Model. User-Adapt. Interact. 29, 2 (2019), 291–343. [13] Yashar Deldjoo, Tommaso Di Noia, and Felice Antonio Merra. 2019. Assessing the Impact of a User-Item Collaborative Attack on Class of Users. In Workshop on the Impact of Recommender Systems (ImpactRecSys’19) - 13th ACM Conference of Recommender Systems. [14] Yashar Deldjoo, Markus Schedl, Paolo Cremonesi, and Gabriella Pasi. 2018. Content-Based Multimedia Recommendation Systems: Definition and Application Domains. In Proceedings of the 9th Italian Information Retrieval Workshop, Rome, Italy, May, 28-30, 2018. (CEUR Workshop Proceedings), Vol. 2140. [15] Wei Dong, Charikar Moses, and Kai Li. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web. ACM, 577–586. [16] Michael D Ekstrand, Mucun Tian, Ion Madrazo Azpiazu, Jennifer D Ekstrand, Oghenemaro Anuyah, David McNeill, and Maria Soledad Pera. 2018. All The Cool Kids, How Do They Fit In?: Popularity and Demographic Biases in Recommender