Recommender Systems Fairness Evaluation
                                  via Generalized Cross Entropy∗
                                              Yashar Deldjoo                                  Vito Walter Anelli
                                  Polytechnic University of Bari, Italy                Polytechnic University of Bari, Italy
                                       yashar.deldjoo@poliba.it                            vitowalter.anelli@poliba.it

                 Hamed Zamani                                           Alejandro Bellogín                             Tommaso Di Noia
          University of Massachusetts                                      Autonomous                          Polytechnic University of Bari, Italy
                Amherst, USA                                        University of Madrid, Spain                    tommaso.dinoia@poliba.it
             zamani@umass.edu                                       alejandro.bellogin@uam.es
ABSTRACT                                                                               race). For example, Ekstrand et al. [16] studied whether RS produce
Fairness in recommender systems has been considered with respect                       equal utility for users of different demographic groups. The authors
to sensitive attributes of users (e.g., gender, race) or items (e.g., rev-             find demographic differences in measured effectiveness across two
enue in a multistakeholder setting). Regardless, the concept has been                  datasets from different domains. Yao and Huang [28] studied various
commonly interpreted as some form of equality – i.e., the degree to                    types of unfairness that can occur in collaborative filtering models
which the system is meeting the information needs of all its users in                  where, to produce fair recommendations, the authors proposed to
an equal sense. In this paper, we argue that fairness in recommender                   penalize algorithms producing disparate distributions of prediction
systems does not necessarily imply equality, but instead it should                     error. For additional resources see [8, 9, 17, 30, 31].
consider a distribution of resources based on merits and needs. We                         Nonetheless, although less common, there are a few works where
present a probabilistic framework based on generalized cross entropy                   fairness has been defined beyond uniformity. For instance, in [5], the
to evaluate fairness of recommender systems under this perspective,                    authors proposed an approach focused on mining the relation be-
where we show that the proposed framework is flexible and explana-                     tween relevance and attention in Information Retrieval by exploiting
tory by allowing to incorporate domain knowledge (through an ideal                     the positional bias of search results. That work promotes the notion
fair distribution) that can help to understand which item or user as-                  that ranked subjects should receive attention that is proportional to
pects a recommendation algorithm is over- or under-representing.                       their worthiness in a given search scenario and achieve fairness of
Results on two real-world datasets show the merits of the proposed                     attention by making exposure proportional to relevance. Similarly, a
evaluation framework both in terms of user and item fairness.                          framework formulation of fairness constraints is presented in [26] on
                                                                                       rankings in terms of exposure allocation, both with respect to group
KEYWORDS                                                                               fairness constraints and individuals. Another approach where non-
recommender systems; fairness; metric; Generalized cross entropy,                      uniform fairness has been used is the work proposed in [29], where
evaluation                                                                             the authors aim to solve the top-k ranking problem by optimizing
                                                                                       a fair utility function under two conditions: in-group monotonicity
1    INTRODUCTION AND CONTEXT                                                          (i.e., rank more relevant items above less relevant within the group)
Recommender systems (RS) are widely applied across the modern                          and group fairness (proportion of protected group items in the top-k
Internet, in e-commerce websites, movies and music streaming plat-                     ranking should be above a minimum threshold). In summary, even
forms, or on social media to point users to items (products or ser-                    though these approaches use some notion of non-uniformity, they
vices) [16]. For evaluation of RS, accuracy metrics are typically em-                  are applied under different perspectives and purposes.
ployed, which measure how much the presented items will be of                              In the present work, we argue that fairness does not necessarily im-
interest to the target user. One commonly raised concern is how much                   ply equality between groups, but instead proper distribution of utility
the recommendations produced by RS are fair. For example, do users                     (benefits) based on merits and needs. To this end, we present a prob-
of certain gender or race receive fair utility (i.e., benefit) from the rec-           abilistic framework for evaluating RS fairness based on attributes of
ommendation service? To answer this question, one has to recognize                     any nature (e.g., sensitive or insensitive) for both items or users and
the multiple stakeholders involved in such systems and that fairness                   show that the proposed framework is flexible enough to measure fair-
issues can be studied for more than one group of participants [8]. In                  ness in RS by considering fairness as equality or non-equality among
a job recommendation scenario, for instance, these multiple groups                     groups, as specified by the system designer or any other parties in-
can be the job seekers and prospective employers where fairness                        volved in multistakeholder setting. As we shall see later, the discussed
toward both parties has to be recognized. Moreover, fairness in RS                     approaches are different from our proposal in that we are able to ac-
can be measured towards items or users; in this context, user and                      commodate different notions of fairness, not only ranking, e.g., rating,
item fairness are commonly associated with an equal chance for                         ranking and even-beyond accuracy metrics. In fact, the main advan-
appearing in the recommendation results (items) or receiving results                   tage of our framework is to provide the system designer with a high
of the same quality (users). As an example for the latter, an unfair                   degree of flexibility on defining fairness from multiple viewpoints.
system may discriminate against users of a particular race or gender.                  Results on two real-world datasets show the merits of the proposed
   One common characteristic of the previous literature focusing                       evaluation framework, both in terms of user and item fairness.
on RS fairness evaluation is that fairness has been commonly inter-
preted as some form of equality across multiple groups (e.g., gender,                  2    EVALUATING FAIRNESS IN RS
                                                                                       In this section, we propose a framework based on generalized cross
∗ Copyright 2019 for this paper by its authors. Use permitted under Creative Commons
                                                                                       entropy for evaluating fairness in recommender systems. Let U and
License Attribution 4.0 International (CC BY 4.0).
Presented at the RMSE workshop held in conjunction with the 13th ACM Conference        I denote a set of users and items, respectively. Suppose A be a set
on Recommender Systems (RecSys), 2019, in Copenhagen, Denmark.                         of sensitive attributes in which fairness is desired. Each attribute can
Table 1: A set of 6 users belonging to groups д1 and д2 and 10 items along with their true labels marked by ✓and recommended
items by recommenders Rec 0, Rec 1, Rec 2. Rec 0 produces 3 and 6 relevant items for free and premium users (in total)
respectively; Rec 1 generates 1 relevant item for each user; Rec 2 produces recommended items that are all relevant for all users.
                                 i 1 i 2 i 3 i 4 i 5 i 6 i 7 i 8 i 9 i 10        Rec 0              Rec 1              Rec 2
                    a 1 user 1 ✓          ✓                 ✓                 {i 1 , i 6 , i 8 } {i 1 , i 5 , i 9 } {i 1 , i 3 , i 7 }
                    a 1 user 2                     ✓             ✓            {i 2 , i 5 , i 9 } {i 2 , i 5 , i 7 } {i 1 , i 5 , i 8 }
                    a 1 user 3        ✓                     ✓                 {i 1 , i 6 , i 7 } {i 2 , i 5 , i 9 } {i 2 , i 7 , i 9 }
                    a 2 user 4            ✓ ✓                        ✓        {i 3 , i 4 , i 9 } {i 4 , i 5 , i 6 } {i 3 , i 4 , i 9 }
                    a 2 user 5                     ✓        ✓            ✓ {i 1 , i 5 , i 7 } {i 1 , i 2 , i 10 } {i 5 , i 7 , i 10 }
                    a 2 user 6 ✓          ✓            ✓             ✓        {i 2 , i 6 , i 9 } {i 1 , i 5 , i 8 } {i 3 , i 6 , i 9 }
Table 2: Fairness of different recommenders in the toy                          other scenarios, a uniform definition of pf might be desired. Gen-
example presented in Table 1 according to proposed GCE                          erally, when fairness is equivalent to equality, then pf should be
and individual-level accuracy metrics. Note that pf0 = [ 12 , 12 ]              uniform and in that case, the generalized cross entropy would be the
and pf1 = [ 23 , 31 ],pf2 = [ 13 , 23 ] characterize the fair distribution as   same as generalized entropy (see [27] for more information).
uniform or non-uniform distributions between two groups.
                      GCE (pf , p, α = −1)                                      2.2     Estimating Performance Distribution p
                                                    P@3           R@3
                     pf0          pf1        pf2                                The performance distribution p should be estimated based on the
    Rec 0 0.0800 0.3025 0.0025                         1     1 . 19 = 0.530     output of the recommender system on a test set. In the following,
                                                       2     6 6                we explain how we can compute this distribution for item attributes.
    Rec 1             0        0.0625 0.0625           1      1 . 9 = 0.375
                                                       3      6 4               We define the recommendation gain (rдi ) for each item i as follows:
                                                             1   23
                                                             6 . 4 = 0.958                                 ϕ(i, RecuK ) д(u,i,r )
                                                                                                         Õ
    Rec 2 0.0078 0.1182 0.0244                        1                                            rдi =                                         (3)
                                                                                                            u ∈U
be defined for either users, e.g., gender and race, or items, e.g., item        where RecuK is the set of top-K items recommended by the system to
provider (or stakeholder).                                                      the user u ∈U . ϕ(i, RecuK ) = 1 if item i is present in RecuK ; otherwise
   The goal is to find an unfairness measure I that produces a non-             ϕ(i, Recuk ) = 0. The function д(u,i,r ) is the gain of recommending
negative real number for a recommender system. A recommender                    item i to user u with the rank r . Such gain function can be defined
system M is considered less unfair (i.e., more fair) than M ′ with              in different ways. In its simplest form, when д(u,i,r ) = 1, the rec-
respect to the attribute a ∈ A if and only if |I (M,a)| < |I (M ′,a)|. Pre-     ommendation gain in Eq. 3 would boil down to recommendation
vious works have used inequality measures to evaluate algorithmic               count (i.e., rдi =rc i ). A binary gain in which д(u,i,r ) = 1 when item
unfairness, however, we argue that fairness does not always imply               i recommended to user u is relevant and д(u,i,r ) = 0 otherwise, is an-
equality. For instance, let us assume that there are two types of users         other simple form of the gain function based on relevance. The gain
in the system – regular (free registration) and premium (paid) – and            functionд can be also defined based on ranking information, i.e., rec-
the goal is to compute fairness with respect to the users’ subscrip-            ommending relevant items to users in higher ranks is given a higher
tion type. In this example, it might be more fair to produce better             gain. In such case, we recommend the discounted cumulative gain
recommendations for paid users, therefore, equality is not always               (DCG) function that is widely used in the definition of NDCG [20],
equivalent to fairness. We define fairness of a recommender system                          2rel(u,i )−1 where rel(u,i) denotes the relevance label for the
                                                                                given by log
as the generalized cross entropy (GCE) for some parameter α , 0,1:                              2 (r +1)
                           1
                                 ∫                                            user-item pair u and i. We can further normalize the above formula
             I (M,x) =              pfα (x)p (1−α ) (x) dx −1           (1)     based on the ideal DCG for user u to compute the gain function д.
                        α(1−α)
                                                                                   The performance probability distribution p is then proportional
where p and pf respectively denote the probability distribution of              to the recommendation gain for the items associated to an attribute
the system performance and the fair probability distribution, both              value a j . Formally, the performance      probability p(a j ) used in Eq. (2)
with respect to the attribute x = a [6]. The unfairness measure I is            is computed as: p(a j ) = i ∈a j rдi /Z where Z is a normalization fac-
                                                                                                             Í
minimized with respect to attribute x =a when p =pf , meaning that              tor set equal to Z = i rдi to make sure that p(a j ) = 1. Under an
                                                                                                          Í                           Í
the performance of the system is equal to the performance of a fair             analogous formulation, we could define a variation of fairness for
system. In the next sections, we discuss how to obtain or estimate              users based on Eq. (3):
these two probability distributions. If the attribute a is discrete or
                                                                                                                   ϕ(i, RecuK ) д(u,i,r )
                                                                                                                Õ
categorical (as typical attributes, such as gender or race), then the                                     rдu =                                            (4)
unfairness measure is defined as:                                                                            i ∈I
                                 Õ                                            where in this case, the gain function cannot be reduced to 1, other-
                            1          α       (1−α )                           wise, all users would receive the same recommendation gain rдu .
              I (M,a) =
                                                              
                                     p (a j )p        (a j )−1        (2)
                        α(1−α)  a f
                                 
                                  j
                                                               
                                                                               3     TOY EXAMPLE
2.1     Fair Distribution pf                                                    For the illustration of the proposed concept, in Table 1 we provide a
                                                                                toy example on how our approach for fairness evaluation framework
The definition of a fair distribution pf is problem-specific and should         could be applied in a real recommendation setting. A set of six users
be determined based on the problem or target scenario in hand. For              belonging to two groups (each group is associated with an attribute
example, a job or music recommendation website may want to en-                  value a 1 (red) or a 2 (green)) who are interacting with a set of items
sure that its premium users, who pay for their subscription, would              are shown in Table 1. Let us assume the red group represents users
receive more relevant recommendations. In this case, pf should be               with a regular (free registration) subscription type on an e-commerce
non-uniform across the user classes (premium versus free users). In             website while the green group represents users with a premium (paid)
subscription type. A set of recommendations produced by different           Table 3: Results of applying the proposed fairness evaluation
systems (Rec0, Rec1, and Rec2) are shown in the last columns. The           metrics on Xing-REC 17 winner submission to identify item-
goal is to compute fairness using the proposed fairness evaluation          centered fairness for the attribute membership type (regular
metric based on GCE given by Eq. (2). The results of evaluation using       v.s. premium). Note that in this case, it is desired to increase
three different evaluation metrics are shown in Table 2. The metrics        the utility of recommendation for premium (paid) users.
used for the evaluation of fairness and accuracy of the system in-          pf0 = [1/2,1/2] (uniform) v.s. pf2 = [1/3,2/3] (non-uniform).
clude: (i) GCE (absolute value), (ii) Precision@3 and (iii) Recall@3.                             Membership type           GCE (pf , p, α = 2)
Note that GCE = 0 means the system is completely fair, and the closer                            regular premium              pf0        pf2
the value is to zero, the more fair the respective system is.                 RSC Winner 4,108,771             547,029      0.2926    0.6786
   By looking at the recommendation results of Rec0, one can note
that if fairness is defined in a uniform way between two groups, defined         Random         4,209,878      445,759      0.3269    0.7335
through fair distribution pf = [ 21 , 12 ], then Rec0 is not a completely
                                                                            Table 4: Results of applying the proposed fairness evalua-
fair system, since GCE = 0.08 , 0. In contrast, if fairness is defined as   tion metrics on Xing-REC 17 winner submission to identify
providing recommendation of higher utility (usefulness) to green users      item-centered fairness. GCE1 and GCE2 have associated fair
who are users with paid premium membership type, (e.g., by setting pf       probability distributions equal to pf0 = [0.25,0.25,0.25,0.25], pf1
= [ 13 , 23 ]) then since GCE ≈ 0, we can say that recommendations pro-     = [0.7,0.1,0.1,0.1], pf2 = [0.1,0.7,0.1,0.1], pf3 = [0.1,0.1,0.7,0.1], pf4 =
duced by Rec0 are fair. Both of the above conclusions are drawn with        [0.1,0.1,0.1,0.7] where pf0 defines fair distribution as uniform
respect to attribute “subscription type” (with categories free/paid         distribution while the rest define it as favoring each of groups
premium membership). This is an interesting insight which shows                                     Country                             GCE (pf , p, α = 2)
the evaluation framework is flexible enough to capture fairness based                  German   Austrian Swiss      Other    pf0      pf1     pf2       pf3   pf4
on the interest of system designer by defining what she considers            winner    3.9M     156.1K     329.4K   186.4K   -0.979   0.061   3.194 3.177     3.192
                                                                             random    3.8M     253.6K     319.6K   239.8K   -0.883   0.038   2.945 2.937     2.946
as fair recommendation through the definition of pf . While in many                               Education                             GCE (pf , p, α = 2)
application scenarios we may define fairness as equality among dif-                    NA       BSc        MSc      PhD      pf0      pf1     pf2       pf3   pf4
                                                                             winner    2.8M     607.2K     1.0M     158.3K   0.398    0.0974 1.673 1.547      1.741
ferent classes (e.g., gender, race), in some scenarios (such as those        random    3.0M     428.4K     1.0M     203.4K   0.450    0.0887 1.838 1.670      1.866
where the target attribute is not sensitive, e.g., regular v.s. premium
users) fairness may not be equivalent to equality.
   Furthermore, by comparing the performance results of Rec1 and            wanted the training set to be as close as possible to an on-line real sce-
Rec2, we observe that, even though precision and recall improve for         nario in which the recommender system is deployed, with this goal in
Rec2 and becomes the most accurate recommendation list, it fails            mind we used a time-aware splitting. The most rigorous one would be
to keep a decent amount of fairness with respect to any parameter           the fixed-timestamp splitting method [10, 18]. In these experiments,
settings of GCE, as in both cases it is outperformed by the other           however, we adopted the methodology proposed in [4] where a single
methods. Moreover, GCE never reaches the optimal value, which               timestamp is chosen, which represents the moment when test users
in this case is attributed to the unequal distribution of resources         are on the platform waiting for recommendations. The training set
among classes, since there are more relevant items on green than            corresponds to the past interactions, and the performance is evalu-
red users. This evidences that optimizing an algorithm to produce           ated with data which correspond to future interactions. The splitting
relevant recommendations does not necessarily result in more fair           timestamp is selected to maximize the number of users involved in
recommendation rather, conversely, a trade-off between the two              the evaluation according to two constraints: the training should re-
evaluation properties can be noticed.                                       tain at least 15 ratings, and the test set should contain at least 5 ratings.
                                                                            4.2       Experimental Setup
4     EXPERIMENTS AND RESULTS                                               Two recommendation scenarios are considered to evaluate the effec-
In the section, we discuss our experimental setup and the results.          tiveness of the proposed fairness evaluation framework with respect
                                                                            to item-centric or user-centric notion of fairness [8].
4.1    Data Descriptions                                                    Item fairness evaluation: It applies the proposed fairness eval-
We conduct experiments on two real-world datasets, Xing job recom-          uation metrics based on GCE on the winner of the ACM RecSys
mendation dataset [2] and Amazon Review dataset [1]. The datasets           Challenge 2017. The challenge was formulated as “given a job posting,
represent different item recommendation scenarios for job and e-            recommend a list of candidates that are suitable candidates for the job”.
commerce domains. We used Xing dataset to study the item-related            As such, the user candidates are considered as target items for recom-
notion of fairness, while Amazon is used to study the user-related          mendation. In order to compute GCE, we used Eq. (3) by considering
notion of fairness.                                                         a simplified case д(u,i,r ) = 1, in which the recommendation gain rдi
Xing Job Recommendation Dataset (Xing-REC 17): The dataset                  boils down to recommendation count rc i for item i, i.e., the number
was first released by XING as part of the ACM RecSys Challenge              of times each user appears in the recommendation lists of all jobs.
2017 for a job recommendation task [2]. The dataset contains 320M              We compare two recommendation approaches: the winner sub-
of interactions happened in over 3 months. The reason for choosing          mission and a random submission, and evaluate the systems’ fairness
this dataset is that it provides several user-related attributes, such as   from the perspective of users membership types, education, and lo-
membership types (regular vs. premium), education degree, and work-         cation. As for membership type, premium users (or paid members)
ing country, that can be useful for the study of fairness. For example,     are expected to receive better quality of recommendation.
membership type allows us to study the non-equal (non-uniform)              User fairness evaluation: Here, we experiment with the more tra-
notion of fairness, as a recruiter may want to ensure premium users         ditional item recommendation task where we study the user fairness
obtain better quality in their recommendations.                             dimension. We consider a scenario where a business owner may
Amazon: We used the toy and games subset which contains 53K pref-           want to ensure superior recommendation quality for its more en-
erence scores by 1K users for 24K items, with a sparsity of 99.8%. We       gaged users over less engaged (or new) users (or vice versa). In order
to have a more intuitive sense about how fair different recommenda-           wants to promote those users with BSc or PhD, the GCE would show
tion models are recommending to users of different classes, we study          that the winner submission provides better recommendations to
the fairness of different CF recommendation models with respect to            those users than the random submission. To the best of our knowl-
users’ interactions, defined in 4 categories: (i) very inactive (VIA), (ii)   edge, this is the first fairness-oriented evaluation metric that allows to
slightly inactive (SIA), (iii) slightly active (SA), and (iv) very active     capture these nuances, which as a consequence, helps on understand-
(VA). For each user, we compute the score n R (u) that corresponds to         ing how the recommendation algorithms work on each user group.
the total number of ratings provided by user u. We group the users               Now Table 5 shows the results for the proposed user fairness eval-
in four groups according to the quartile that this score belongs to.          uation as described in Section 4.2. We observe in this table that each
   We have experimented with several recommendation models such               recommender obtains a GCE value on a different range, an obvious
as UserKNN [7], ItemKNN [25] (considering binarized and cosine                consequence of the different performance obtained in each case for
similarity metric, Jaccard coefficient [15], and Pearson correlation          the different groups (as we observe in the NDCG@10 columns for
[19]), SVD++ [21, 22], BPRMF [22, 24], BPRSlim [23], and two non-             each user type). For instance, BPRMF is the one found by GCE to
personalized models, most-popular and random recommender.                     perform in a fair way when assuming uniformity with respect to the
   For comparison with the proposed GCE metric, we include two                user groups (pf0 ), however, if the system designer aims to promote
complementary baseline metrics based on the absolute deviation                those recommenders that provide better suggestions to the most
between the mean ratings of different groups as defined in [31]               active users then Random, followed by ItemKNN and SVD++ are the
MAD(R (i) , R (j) ) = RR(i ) − RR(j) where R (i) denotes the predicted
                     Í (i )   Í (j)
                                                                              most fair algorithms.
                      | | | |                                                    Comparing MAD against GCE, we observe that MAD-ranking
ratings for all user-item combinations in group i and R (i) is its            produces lower results when NDCGs in each class are close to each
                                                                              other (e.g., in the case of Random recommender), which corresponds
size. Larger values for MAD mean larger differences between the
                                                                              to the already discussed notion of fairness as equality/uniformity;
groups, interpreted as unfairness. Given that our proposed GCE in
                                                                              similarly, MAD-rating obtains better results for the random algo-
user-fairness evaluation is based on NDCG, we adapt this defini-
                                                                              rithm because, as expected, such method has no inherent bias with
tion to also compare between average NDCG for each group. We
                                                                              respect to the defined user groups, but also for SVD++, probably be-
refer to these two baselines as MAD-rating and MAD-ranking.
                                                                              cause this recommender tends to predict ratings in a small range. In
Finally, the reported MAD corresponds to the average MAD be-
                                                                              both cases, it becomes evident that MAD, in contrast to our proposed
tween all the pairwise combinations within the groups involved, i.e.,
                                                                              GCE metric, cannot incorporate other definitions of fairness in its
MAD = avgi, j (MAD(R (i) ,R (j) )).                                           computation, hence, its flexibility is very limited.
4.3    Results and Discussion                                                    In summary, we have shown that our proposed fairness evalua-
                                                                              tion metric is able to unveil whether a recommendation algorithm
We start our analysis with the results for the item fairness evalu-           satisfies our definition of fairness, where we argue that it should
ation as described in Section 4.2, presented in Tables 3 and 4. The           emphasize a proper distribution of utility based on merits and needs.
counts in these tables represent the total number of users with a             We demonstrate this in both notions of fairness: based on users
given category that each submission recommends. We observe in                 and based on items. Therefore, we conclude that this metric could
Table 4 that recommendations produced by the RecSys Challenge                 help better explaining the results of the algorithms towards specific
winner performs better with pf0 than with pf2 , since the GCE value is        groups of users and items, and as a consequence, it could increase
closer to 0. This evidences that the proposed winner system produces          the transparency of the recommender systems evaluation.
balanced recommendations across the two membership classes. This
is in contrast to our expectation that premium users should be pro-
vided better recommendations. Therefore, even though the winning
submission could produce higher recommendation quality from a
                                                                              5    CONCLUSION
global perspective, it does not comply with our expectation of a fair         Fairness-aware recommendation research requires appropriate eval-
recommendation for this attribute, which is to recommend better               uation metrics to quantify fairness. Furthermore, fairness in RS can
recommendations to premium users.                                             be associated with either items or users, even though this comple-
    Furthermore, in Table 4 we present the recommendation fairness            mentary view has been underrepresented in the literature. In this
evaluation results using GCE across two other attributes: Country             work, we have presented a probabilistic framework to measure fair-
and Education; each of these attributes takes 4 categories. We define         ness of RS under the perspective of users and items. Experimental
five variations of the fair distribution pf : while pf0 considers all at-     results on two real-world datasets show the merits of the proposed
tribute categories equally important, the others give one attribute           evaluation framework. In particular, one of the key aspects of our
category a higher importance compared to the rest. After applying             proposed evaluation metric is its transparency and flexibility, since
the GCE on the winner submission, we observe that with respect to             it allows to incorporate domain knowledge (by means of an ideal fair
the Country attribute, the lowest value of GCE (best case) is produced        distribution) that helps on understanding which item or user aspects
for the German companies (GCE = 0.061) while for the Education                the recommendation algorithms are over- or under-representing.
attribute the category Unknown (GCE = 0.97) produces the best                     In the future, we plan to exploit the proposed fairness and rele-
outcome, in both cases, these categories are the most frequently rec-         vance aware evaluation system to build recommender systems that
ommended by the analyzed submission. These results show that for a            directly optimize for this objective criterion. Also, it is of our inter-
given target application, if the system is looking for candidates with        est to consider studying various fairness of recommendation under
certain nationality (in this case, German) or education-level (here           various content-based filtering or CF models using item content as
any), the system recommendations coming from the winner submis-               side information [11, 12] on different domains (e.g., tourism [3], en-
sion are closer to a fair system. In fact, due to the inherent biases in      tertainment [14], social recommendation among others). Finally, we
the dataset, the random submission is obtaining better results accord-        are considering to investigate the robustness of CF models against
ing to our definition of fairness for several of the fair distributions       shilling attacks [13] crafted to undermine not only the accuracy of
analyzed. However, it is worth mentioning that if the system designer         recommendations but also fairness of these models.
Table 5: Results of applying the proposed fairness evaluation metrics on Amazon dataset to identify user-centered fairness. The
fair probability distributions are defined as pfi so that pfi (j) = 0.1 when j ,i and 0.7 otherwise, except for pf0 that denotes the uni-
form distribution; i.e., just as in Table 4. Types of users (VIA/SIA/SA/VA) as defined in Section 4.2. Best values per column in bold.
                                                  NDCG@10                                        GCE (p f , p, α = −1)                             MAD
                                     VIA         SIA     SA              VA          p f0     p f1       p f2        p f3         p f4       rating ranking
                   Random           0.0000      0.0000 0.0000          0.0005      1.5000   4.5000    4.5000       4.5000       0.2143       0.0000 0.0003
                 MostPopular        0.0000      0.0006 0.0013          0.0014      0.2435   1.3586    1.2289       0.6714       0.5825       0.1864  0.0008
                  ItemKNN           0.0023      0.0021 0.0016          0.0036      0.0487   0.6218    0.6722       0.7537       0.2636       0.0254  0.0011
                  UserKNN           0.0031      0.0040 0.0037          0.0053      0.0214   0.6483    0.5379       0.5783       0.3319       0.0375  0.0012
                   BPRMF            0.0022      0.0025 0.0028          0.0016      0.0191   0.5496 0.4767 0.3881                0.6642       0.2078  0.0006
                   BPRSlim          0.0027      0.0023 0.0035          0.0017      0.0353   0.5377 0.6150 0.3267                0.7267       9.0009  0.0010
                    SVD++           0.0025      0.0025 0.0025          0.0042      0.0324   0.6336    0.6382       0.6361       0.2750       0.0027  0.0009

6    ACKNOWLEDGEMENTS                                                                            Evaluation and Effectiveness. In Conference on Fairness, Accountability and
                                                                                                 Transparency. 172–186.
This work was supported in part by the Center for Intelligent Infor-                        [17] Golnoosh Farnadi, Pigi Kouki, Spencer K. Thompson, Sriram Srinivasan, and
mation Retrieval and in part by project TIN2016-80630-P (MINECO).                                Lise Getoor. 2018. A Fairness-aware Hybrid Recommender System. CoRR
Any opinions, findings and conclusions or recommendations ex-                                    abs/1809.09030 (2018). arXiv:1809.09030
                                                                                            [18] Asela Gunawardana and Guy Shani. 2015. Evaluating Recommender Systems.
pressed in this material are those of the authors and do not neces-                              In Recommender Systems Handbook. Springer, 265–308.
sarily reflect those of the sponsors.                                                       [19] Jonathan L. Herlocker, Joseph A. Konstan, and John Riedl. 2002. An Empirical
                                                                                                 Analysis of Design Choices in Neighborhood-Based Collaborative Filtering
                                                                                                 Algorithms. Inf. Retr. 5, 4 (2002), 287–310.
REFERENCES                                                                                  [20] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation
 [1] 2014 (accessed June 5, 2019). Amazon product data. http://jmcauley.ucsd.edu/                of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002),
     data/amazon/.                                                                               422–446.
 [2] Fabian Abel, Yashar Deldjoo, Mehdi Elahi, and Daniel Kohlsdorf. 2017. RecSys           [21] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted
     Challenge 2017: Offline and Online Evaluation. In Proc. RecSys (RecSys ’17). ACM,           collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International
     New York, NY, USA, 372–373.                                                                 Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA,
 [3] Jens Adamczak, Gerard-Paul Leyson, Peter Knees, Yashar Deldjoo, Far-                        August 24-27, 2008. ACM, 426–434.
     shad Bakhshandegan Moghaddam, Julia Neidhardt, Wolfgang Wörndl, and                    [22] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization
     Philipp Monreal. 2019. Session-Based Hotel Recommendations: Challenges and                  Techniques for Recommender Systems. Computer 42, 8 (2009), 30–37.
     Future Directions. arXiv preprint arXiv:1908.00071 (2019).                             [23] Xia Ning and George Karypis. 2011. SLIM: Sparse Linear Methods for Top-N Recom-
 [4] Vito Walter Anelli, Tommaso Di Noia, Eugenio Di Sciascio, Azzurra Ragone, and               mender Systems. In 11th IEEE International Conference on Data Mining, ICDM 2011,
     Joseph Trotta. 2019. Local Popularity and Time in top-N Recommendation. In                  Vancouver, BC, Canada, December 11-14, 2011. IEEE Computer Society, 497–506.
     Advances in Information Retrieval - 41st European Conference on IR Research, ECIR      [24] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.
     2019, Cologne, Germany, April 14-18, 2019, Proceedings, Part I. 861–868.                    2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI 2009,
 [5] Asia J. Biega, Krishna P. Gummadi, and Gerhard Weikum. 2018. Equity of                      Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence,
     Attention: Amortizing Individual Fairness in Rankings. In Proc. SIGIR, Ann Arbor,           Montreal, QC, Canada, June 18-21, 2009. AUAI Press, 452–461.
     MI, USA, July 08-12, 2018. ACM, 405–414.                                               [25] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2000. Analysis
 [6] Zdravko I Botev and Dirk P Kroese. 2011. The generalized cross entropy method,              of recommendation algorithms for e-commerce. In Proceedings of the 2nd ACM
     with applications to probability density estimation. Methodology and Computing              conference on Electronic commerce. ACM, 158–167.
     in Applied Probability 13, 1 (2011), 1–27.                                             [26] Ashudeep Singh and Thorsten Joachims. 2018. Fairness of Exposure in Rankings.
 [7] John S. Breese, David Heckerman, and Carl Myers Kadie. 1998. Empirical Analysis             In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Dis-
     of Predictive Algorithms for Collaborative Filtering. In UAI ’98: Proceedings of            covery & Data Mining, KDD 2018, London, UK, August 19-23, 2018. ACM, 2219–2228.
     the Fourteenth Conference on Uncertainty in Artificial Intelligence, University of     [27] Till Speicher, Hoda Heidari, Nina Grgic-Hlaca, Krishna P Gummadi, Adish
     Wisconsin Business School, Madison, Wisconsin, USA, July 24-26, 1998. Morgan                Singla, Adrian Weller, and Muhammad Bilal Zafar. 2018. A Unified Approach
     Kaufmann, 43–52.                                                                            to Quantifying Algorithmic Unfairness: Measuring Individual &Group Unfairness
 [8] Robin Burke. 2017. Multisided fairness for recommendation. arXiv preprint                   via Inequality Indices. In Proceedings of the 24th ACM SIGKDD International
     arXiv:1707.00093 (2017).                                                                    Conference on Knowledge Discovery & Data Mining. ACM, 2239–2248.
 [9] Robin Burke, Nasim Sonboli, and Aldo Ordonez-Gauger. 2018. Balanced                    [28] Sirui Yao and Bert Huang. 2017. Beyond parity: Fairness objectives for collaborative
     Neighborhoods for Multi-sided Fairness in Recommendation. In Conference on                  filtering. In Advances in Neural Information Processing Systems. 2921–2930.
     Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New          [29] Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed,
     York, NY, USA (Proceedings of Machine Learning Research), Vol. 81. PMLR, 202–214.           and Ricardo A. Baeza-Yates. 2017. FA*IR: A Fair Top-k Ranking Algorithm.
[10] Pedro G. Campos, Fernando Díez, and Iván Cantador. 2014. Time-aware recom-                  In Proceedings of the 2017 ACM on Conference on Information and Knowledge
     mender systems: a comprehensive survey and analysis of existing evaluation                  Management, CIKM 2017, Singapore, November 06 - 10, 2017. ACM, 1569–1578.
     protocols. User Model. User-Adapt. Interact. 24, 1-2 (2014), 67–119.                   [30] Yong Zheng, Tanaya Dave, Neha Mishra, and Harshit Kumar. 2018. Fairness In
[11] Yashar Deldjoo, Mihai Gabriel Constantin, Hamid Eghbal-Zadeh, Bogdan Ionescu,               Reciprocal Recommendations: A Speed-Dating Study. In Adjunct Publication of
     Markus Schedl, and Paolo Cremonesi. 2018. Audio-visual encoding of multimedia               the 26th Conference on User Modeling, Adaptation and Personalization, UMAP 2018,
     content for enhancing movie recommendations. In Proc. RecSys, RecSys 2018,                  Singapore, July 08-11, 2018. ACM, 29–34.
     Vancouver, BC, Canada, October 2-7, 2018. ACM, 455–459.                                [31] Ziwei Zhu, Xia Hu, and James Caverlee. 2018. Fairness-Aware Tensor-Based
[12] Yashar Deldjoo, Maurizio Ferrari Dacrema, Mihai Gabriel Constantin, Hamid                   Recommendation. In Proc. CIKM, CIKM 2018, Torino, Italy, October 22-26, 2018.
     Eghbal-zadeh, Stefano Cereda, Markus Schedl, Bogdan Ionescu, and Paolo                      ACM, 1153–1162.
     Cremonesi. 2019. Movie genome: alleviating new item cold start in movie
     recommendation. User Model. User-Adapt. Interact. 29, 2 (2019), 291–343.
[13] Yashar Deldjoo, Tommaso Di Noia, and Felice Antonio Merra. 2019. Assessing
     the Impact of a User-Item Collaborative Attack on Class of Users. In Workshop
     on the Impact of Recommender Systems (ImpactRecSys’19) - 13th ACM Conference
     of Recommender Systems.
[14] Yashar Deldjoo, Markus Schedl, Paolo Cremonesi, and Gabriella Pasi. 2018.
     Content-Based Multimedia Recommendation Systems: Definition and Application
     Domains. In Proceedings of the 9th Italian Information Retrieval Workshop, Rome,
     Italy, May, 28-30, 2018. (CEUR Workshop Proceedings), Vol. 2140.
[15] Wei Dong, Charikar Moses, and Kai Li. 2011. Efficient k-nearest neighbor
     graph construction for generic similarity measures. In Proceedings of the 20th
     international conference on World wide web. ACM, 577–586.
[16] Michael D Ekstrand, Mucun Tian, Ion Madrazo Azpiazu, Jennifer D Ekstrand,
     Oghenemaro Anuyah, David McNeill, and Maria Soledad Pera. 2018. All The Cool
     Kids, How Do They Fit In?: Popularity and Demographic Biases in Recommender