1. Introduction

September

10.15439/2021F117

Consumer Fairness Benchmark in Recommendation

Discussion Paper

Ludovico Boratto

Gianni Fenu

Mirko Marras

Giacomo Medda

0 0 Department of Mathematics and Computer Science, University of Cagliari , Cagliari , Italy

2023

20 2019 2 5

Several mitigation procedures have emerged to address consumer unfairness in personalized rankings. However, evaluating their performance is dificult due to variations in experimental protocols, such as difering fairness definitions, data sets, evaluation metrics, and sensitive attributes. This makes it challenging for scientists to choose a suitable procedure for their practical setting. In this paper, we summarize our previous work on investigating the properties a given mitigation procedure against consumer unfairness should be evaluated on. To this end, we defined eight technical properties and leveraged two public datasets to evaluate the extent to which existing mitigation procedures against consumer unfairness met these properties. Source code and data: https://github.com/jackmedda/Perspective-C-Fairness-RecSys.

eol>Recommender Systems Consumer Fairness Mitigation Procedure Reproducibility Evaluation Protocol

1. Introduction

With the large adoption of decision-support systems, governments are establishing regulations to account for their trustworthiness. Indeed, it is fundamental to highlight and administer the harmful impacts of artificial intelligence (AI) systems. Recommender systems denote a notable example of systems where trustworthiness and safety are key aspects to be concerned about. In such systems, people are provided with personalized suggestions generated by a certain model [ 1, 2 ]. Prior studies have however shown that recommender systems often lead to discriminatory outcomes [ 3, 4, 5 ], afecting the entity being ranked or the users the recommendations are targeted to (consumers) [ 6, 7, 8 ]. Despite the growing interest in providing fair recommendations to consumers, diverging definitions of consumer fairness have led to unfairness mitigation procedures built on top of heterogeneous evaluation protocols. It is then crucial to discussing which properties a mitigation procedure against consumer unfairness should be evaluated on.

In this paper, we summarize our prior work [ 9 ] on building a common ground that can act as a basis for the evaluation of consumer unfairness mitigation procedures. To this end, we defined eight technical properties a given mitigation procedure against consumer unfairness should meet for being efective in practice. We then benchmarked the extent to which existing mitigation procedures meet the defined properties, qualitatively and quantitatively (when possible), on two public data sets. Finally, we gathered the evaluation performance of the mitigation procedures under each property and highlighted the extent to which each procedure meets these properties.

2. Perception of the State of the Art

In this section, we describe the process followed for collecting papers about consumer unfairness mitigation procedures and then categorizing them based on how unfairness was defined, mitigated, and assessed (Table 1). Please refer to the original work [ 9 ] for detailed information. Paper Collection Process. Mitigation procedures against consumer unfairness proposed so far in the literature were collected by scanning Information Retrieval conferences and workshops proceedings as well journals with high impact. Relying on the framework shared by [ 10 ] when possible, we reproduced the mitigation procedures proposed in the collected papers. Fairness Definition Perception . There is no consensus on how to perceive unfairness from a consumer perspective in recommendation. Studies often consider diferent viewpoints to analyze, mitigate, and evaluate unfairness. Generally, these studies explored fairness notions that mainly address two principles: equity of certain metric scores between demographic groups (EQ); independence of a certain outcome from the sensitive attribute (IND).

Unfairness Mitigation Perception. Studies focusing on fairness from an EQ perspective usually perform a mitigation by balancing the representation of groups in the training set (e.g., [ 11 ]), reducing the error across groups (e.g., [ 6, 16 ]) or re-ranking items (e.g., [ 12, 14 ]). From an IND perspective, unfairness is usually countered by decoupling the user and item latent representations from sensitive attribute information (e.g., [15, 17]) or introducing independence guarantees between the sensitive attribute and the predicted relevance score (e.g., [13]).

3. Research Methodology

In this section, we describe the data sets, sensitive attributes, recommendation models, and the unified evaluation protocol [ 10 ] used to benchmark the collected mitigation procedures. Experimental Data Sets. We selected the two public data sets reported in Table 2, namely ML1M (movies) and LFM1K (music). We considered binarized protected attribute labels (if not already binary, the groups were binarized to have the most similar representation possible). Recommendation Models. The range of recommendation models evaluated in prior work in terms of consumer unfairness was heterogeneous, since no common protocol existed. Our study in this paper focuses on recommendation models considered in at least one prior work. Evaluation Protocol. For each setup, we obtained the predicted relevance scores and monitored the utility of top-n recommendations through NDCG. Unfairness between consumer groups was monitored from an equity (EQ) perspective in terms of NDCG Demographic Parity (DP) [19], computed as the diference in NDCG between the majority group and the minority group (w.r.t. their representation in the data set), and from an independence (IND) perspective by means of a Kolmogorov-Smirnov test (KS) on the predicted relevance scores, as also proposed by [20].

4. Mitigation Procedures Benchmark

In this section, we propose eight key properties to consider while evaluating a mitigation procedure ofline, before moving it into practice. Table 3 reports the performance of the recommender systems, before and after unfairness was mitigated, in terms of recommendation utility and fairness between gender groups. Other results can be found in our original study [ 9 ]. Applicability. Indicates the extent to which a mitigation procedure can be technically run on a wide range of diferent recommendation models without requiring any substantial change to the fundamental steps it is based on. Pre-processing approaches potentially have a very high applicability, while the applicability of in-processing and post-processing approaches could be afected by aspects related to the implementation or to the adopted fairness notion. Coherence. Indicates the extent to which a mitigation procedure tends to reduce the biased outcomes for the originally disadvantaged group, without reversing the disparate outcome towards the other group(s). In Table 3, low coherence was reported by SLIM-U since applying the mitigation of [ 6 ] led to male users being advantaged instead of female users.

Consistency. Indicates the ability of a mitigation procedure to substantially reduce the model’s unfairness according to the pursued fairness notion, given any data set and any consumer grouping method. Overall, Li et al. [14] was the only consistent mitigation procedure across data sets and sensitive attributes under our unified evaluation protocol. Instead, under the papers’ original evaluation protocols, no procedure was consistent according to our definition. Data Robustness. Indicates the ability of a mitigation procedure to reduce unfairness also in challenging cases related to data distribution (e.g., imbalances) and relationships between unfairness and other features. Our analysis uncovered how leveraging data characteristics causally-related to unfairness, e.g., popularity bias [ 11 ], to reduce it could provide better insights on the problem. Reproducibility. Indicates the ability of taking the original source code that implements a mitigation procedure and being able to execute it under the same or a diferent evaluation protocol, with respect to the one used in the original paper. Our analysis showed that 2 out of 8 papers were not reproducible, which limited our work and it remarks the need to sharing the source code. Scalability. Indicates the ability of a mitigation procedure to scale well when the number of interactions, users, items, and sensitive attributes, and other relevant features increases consistently. On data sets with a higher number of entities (e.g., users, interactions), some mitigation procedures (Li et al. and Burke et al. [ 6, 14 ]) would lead to unmanageable time and memory requirements. Trade-of Management . Indicates the ability of a mitigation procedure to preserve the performance estimate achieved by the target recommendation model originally (before the mitigation was applied). Overall, Ekstrand et al. [ 11 ] reported the best trade-of across all the data sets and sensitive attributes. It reduced unfairness, while minimally afecting utility.

Transferability. Indicates the ability of a mitigation procedure to be efective (and not only applicable) on a wide range of recommendations models, even those it was not originally designed for or tested on. We applied the mitigation procedures of Ekstrand et al. [ 11 ] and Li et al. [14] on the models used by the other papers. Both methods do not hold a good transferability level. Discussion. As a summary, for each property and mitigation procedure, we assigned one of the two following labels: Higher when the corresponding work was better than the others on average for the selected property, Lower otherwise. The mitigation procedures proposed by [ 14, 11 ] reported the highest number of above-average properties.

5. Conclusions and Future Work

In this work, we collected and reproduced relevant papers addressing consumer unfairness mitigation and categorized them according to the definition, mitigation, and assessment strategy. Then, we defined a unified experimental protocol, including eight technical properties a mitigation procedure should meet, and evaluated the reproduced mitigation procedures on two public data sets on the basis of the defined evaluation properties. Our work allows to have a better understanding of the aspects that could increase the mitigation efectiveness and what can be done to avoid the phenomena outlined by our experiments. Future work will consider novel mitigation procedures able to satisfy all the properties introduced in our paper.

[1]

M. G.

Armentano ,

Monteserin ,

Berdun ,

Bongiorno ,

L. M.

Coussirat , User recommendation in low degree networks with a learning-based approach , in: Mexican International Conference on Artificial Intelligence , Springer, 2018 , pp. 286 - 298 .

[2]

Mauro ,

Ardissono ,

Cocomazzi ,

Cena , Using consumer feedback from locationbased services in poi recommender systems for people with autism , Expert Systems with Applications 199 ( 2022 ) 116972 .

[3]

Dinnissen ,

Bauer , Fairness in music recommender systems: A stakeholder-centered mini review, Frontiers in Big Data (????) 63.

[4]

Lesota ,

Melchiorre ,

Rekabsaz ,

Brandl ,

Kowald , E. Lex,

Schedl , Analyzing item popularity bias of music recommender systems: Are diferent genders equally afected? , in: Fifteenth ACM Conference on Recommender Systems , 2021 , pp. 601 - 606 .

[5]

Neidhardt ,

Sertkan , Towards an approach for analyzing dynamic aspects of bias and beyond-accuracy measures , in: International Workshop on Algorithmic Bias in Search and Recommendation , Springer, 2022 , pp. 35 - 42 .

[6]

Burke ,

Sonboli ,

Ordonez-Gauger , Balanced neighborhoods for multi-sided fairness in recommendation , in: Conference on Fairness, Accountability and Transparency, FAT 2018 , 23 -24 February 2018 , New York, NY, USA, volume 81 of Proceedings of Machine Learning Research, PMLR , 2018 , pp. 202 - 214 . URL: http://proceedings.mlr.press/v81/burke18a. html.

[7]

Chen ,

Dong ,

Wang ,

Feng ,

Wang ,

He , Bias and debias in recommender system: A survey and future directions , CoRR abs/ 2010 .03240 ( 2020 ). URL: https://arxiv. org/abs/ 2010 .03240. arXiv: 2010 .03240.

[8]

Abdollahpouri , G. Adomavicius,

Burke ,

Guy ,

Jannach ,

Kamishima ,

Krasnodebski ,

L. A.

Pizzato , Multistakeholder recommendation: Survey and research directions, User Model . User Adapt. Interact . 30 ( 2020 ) 127 - 158 . URL: https://doi.org/10.1007/ s11257-019-09256-1. doi: 10 .1007/s11257-019-09256-1.

[9]

Boratto , G. Fenu,

Marras , G. Medda, Practical perspectives of consumer fairness in recommendation , Inf. Process. Manag . 60 ( 2023 ) 103208 . URL: https://doi.org/10.1016/j. ipm. 2022 . 103208 . doi: 10 .1016/j.ipm. 2022 . 103208 .

[10]

Boratto , G. Fenu,

Marras , G. Medda, Consumer fairness in recommender systems: Contextualizing definitions and mitigations , in: M. Hagen , S.

Verberne , C.

Macdonald , C.

Seifert , K.

Balog , K.

Nørvåg , V. Setty (Eds.), Advances in Information Retrieval , Springer International Publishing, Cham, 2022 , pp. 552 - 566 .

[11] M. D. Ekstrand , M.

Tian , I. M.

Azpiazu , J. D.

Ekstrand , O.

Anuyah , D.

McNeill , M. S.

Pera , All the cool kids, how do they fit in?: Popularity and demographic biases in recommender evaluation and efectiveness , in: Conference on Fairness, Accountability and Transparency, FAT 2018 , volume 81 , PMLR , 2018 , pp. 172 - 186 . URL: http://proceedings.mlr.press/v81/ ekstrand18b.html.

[12]

Tsintzou , E. Pitoura,

Tsaparas , Bias disparity in recommendation systems , in: R. Burke,

Abdollahpouri ,

E. C.

Malthouse ,

K. P.

Thai , Y. Zhang (Eds.), Proceedings of the Workshop on Recommendation in Multi-stakeholder Environments co-located with the 13th ACM Conference on Recommender Systems (RecSys 2019 ), Copenhagen, Denmark,