-

1613-0073

Metrics⋆

Discussion Paper

Vincenzo Paparella

vincenzo.paparella@poliba.it 0

Dario Di Palma

d.dipalma2@phd.poliba.it 0

Vito Walter Anelli

Alessandro De Bellis

Tommaso Di Noia

Recommender System, Multi-Objective Evaluation, Pareto optimality

0 Politecnico di Bari , via Orabona, 4, 70125 Bari , Italy

Current recommender systems (RSs) prioritize accuracy, often neglecting aspects like diversity and fairness. This single-metric approach overlooks valuable trade-ofs between diferent qualities. We propose a multi-objective evaluation using Pareto optimality and Quality Indicators (QI) of Pareto frontiers to consider all model configurations simultaneously across multiple perspectives. This approach reveals a more comprehensive picture of RS performance, potentially leading to a reevaluation of existing methods. Code and data are available at https://github.com/sisinflab/RecMOE.

CEUR ceur-ws.org

1. Introduction

The success of Recommender Systems (RSs) is often measured by their ability to accurately predict a user’s preferences and suggest relevant items. However, beyond-accuracy metrics like diversity [ 2 ], novelty [ 3, 4 ], and fairness [ 5, 6 ] have been proposed. While beyond-accuracy metrics have gained momentum, accuracy is still prioritized [ 7, 8, 9 ]. Figure 1 shows the normalized performance of baselines on the Goodreads dataset, selecting the best hyper-parameters for each metric. Selecting the best model solely based on accuracy limits consideration of beyond-accuracy performance. A Pareto-optimal configuration improves at least one objective without hurting others, forming the Pareto frontier [ 10, 11 ]. We propose introducing Quality Indicators (QIs) [ 12 ] to RSs, providing a quantitative evaluation of Pareto frontiers from different perspectives [ 13 ]. Our contributions are (i) Showing the negative impact of prioritizing accuracy and motivating multi-objective evaluation; (ii) Computing Pareto frontiers for hyperparameter settings of models on public datasets in multi-objective scenarios. (iii) Enhancing multi-objective evaluation by utilizing QIs to comprehensively analyze recommendation models.

2. Quality Indicators

In this Section, we present the Quality Indicators (QIs) to assess the Pareto frontiers corresponding to an RS model. They can be classified according to the quality they assess. ⋆Extended version [ 1 ] published at the 17th ACM Conference on Recommender Systems (RecSys 2023). Div

Bias Div

Bias

Nov (a) UserKNN

Nov (b) RP3

Nov (c) EASE

Models chosen for the best values of Accuracy/Novelty Diversity Bias

frontier. For our study, we use the Maximum Spread (MS) [ 14 ]. Specifically, this spread indicator measures the range of a Pareto frontier by considering the maximum extent of each objective. The higher the value, the better the extensiveness of the curve.

∑∈

() Uniformity QI. The uniformity of a Pareto frontier provides information about the distribution of the solutions. A higher uniformity of the curve denotes that the solutions are less dispersed, while a low uniformity indicates more diversity within the set. Specifically, we employ the Spacing metric (SP) [ 15 ] that measures the variation in the Manhattan distances between the Pareto-optimal solutions. The lower the value, the more concentrated the solutions are on the Pareto frontier. However, an = 0

indicates that all the solutions could be equidistant.

Cardinality QI. Given generic solutions belonging to the set , the QIs for cardinality determine the proportion of Pareto-optimal solutions in this set. Specifically, the Error Ratio (ER) [ 16 ] is defined as () = with () = 1 if is a Pareto-optimal solution, 0 otherwise. A higher ER value indicates greater Pareto-optimal solutions in the set . All quality aspects QI. The QIs included in this category provide insights into the spread, uniformity, and cardinality of the Pareto frontiers simultaneously. Among them, the Hypervolume (HV) [ 17 ] is a volume-based QI that measures the volume of the objective function space dominated by the Pareto frontier. The larger the hypervolume, the better the solution set is.

3. Experiments

We aim to answer two research questions: RQ1: To what extent can the models provide Paretooptimal configurations? Are these configurations uniformly distributed, or are they dispersed enhancing diverse solutions to the trade-of?

RQ2: Which model has the Pareto frontier that simultaneously ofers better solutions on multiple metrics? Datasets. We select three diferent datasets to cover several domains. Specifically, we use Amazon Music (music), Goodreads [ 18 ] (books), and Movielens1M [ 19 ] (movies). Baselines and Hyper-parameters Settings Exploration. We train five recommendation algorithms, i.e., EASE [ 20 ], MultiVAE [ 21 ], LightGCN [ 22 ], RP3 [ 23 ], and UserKNN [ 24 ]. We train 32 hyper-parameter values combinations of each model by using Elliot [ 25 ]. Metrics. We assess the baselines’ performance under several perspectives. We compute nDCG, Precision, and Recall for the accuracy of recommendations. From the final user point of view, we evaluate the diversity (with Gini index [ 26 ] and Item Coverage) and novelty (with EPC and EFD [ 3 ]). Finally, we measure the popularity bias of the recommendations with APLT [ 27 ] – the greater, the better – and ARP [ 26 ] – the less, the better. All these metrics refer to cutof 10. Multi-Objective Evaluation Methodology. We obtain Pareto frontiers for each recommender system (RS) baseline using the metrics described in Section 2. Each hyper-parameter setting represents a solution in the objective space. We identify the Pareto-optimal configurations for each baseline, forming their respective Pareto frontiers. We evaluate these frontiers using QIs under two scenarios: 1) user-centered (accuracy, diversity, novelty) and 2) accuracy vs. algorithmic bias. Figure 2 shows the resulting Pareto frontiers. 3.1. Results and Discussion While EASE and UserKNN provide the most accurate recommendations, beyond-accuracy metrics paint a diferent picture. By observing Figure 2, UserKNN exhibits better diversity than EASE . Finally, RP3 consistently outperforms its competitors in addressing the popularity bias. We delve into a multi-objective evaluation using QIs on Pareto frontiers. Here, we examine the distribution of Pareto-optimal configurations and performance on all quality metrics. Distribution of Pareto-optimal configurations. The Error Ratio (ER), Maximum Spread (MS), and Spacing metric (SP) values in Table 1 unveil interesting insights into the distribution of Pareto-optimal configurations for each model. In the nDCG/APLT scenario for the Movielens1M dataset, for instance: 1) UserKNN exhibits a wide range of solutions with good dispersion across the Pareto frontier, indicating its ability to ofer various well-balanced trade-ofs between accuracy and algorithmic bias; 2) EASE , while ofering a high number of solutions on the frontier, they tend to be concentrated in a limited area, suggesting a lack of diversity in the achievable trade-ofs; 3) RP3 strikes a good balance between the number of solutions, their dispersion, and the ability to provide various trade-ofs between accuracy and bias. This is reflected in its high ER, MS, and SP values. Similar trends are observed for the other datasets (a) Amazon, nDCG/Gini/EPC (b) Goodreads, nDCG/Gini/EPC (c) ML1M, nDCG/Gini/EPC (d) Amazon, nDCG/APLT (e) Goodreads, nDCG/APLT

(f) ML1M, nDCG/APLT RP3

EASE UserKNN LightGCN MultiVAE

(see Figures 2f - 2e). When examining the user-centric scenario (nDCG/Gini/EPC), UserKNN again excels, ofering well-diversified solutions across all datasets (see Figures 2a - 2c). Performance on all quality metrics. In response to RQ2, we can utilize the Hypervolume (HV) measure. HV evaluates the performance of models from multiple objectives simultaneously, as shown in Table 1. By considering the cardinality and dispersion of the Pareto-optimal solutions and the dominance among the Pareto frontiers, HV provides us with valuable insights. The higher the volume or area under the frontier, the greater the HV. The results show that UserKNN outperforms the other models by achieving the best or second-best values of HV for all datasets and scenarios. This result indicates that UserKNN generates an extensive and diversified Pareto frontier while performing well across all metrics. While EASE has the highest value of HV for the Amazon Music dataset in the user-centred scenario, it does not dominate or get dominated in the remaining cases. This result highlights the model’s limited reliance on accounting for multiple metrics. LightGCN shows no distinctive trends, while MultiVAE’s HV decreases when dealing with sparser datasets. RP3 confirms its capability in managing the nDCG/APLT tradeof by achieving the highest values of HV and visual dominance of its Pareto frontiers against the others in Figures 2d, 2e, and 2f.

4. Conclusion and Future Work

Our multi-objective evaluation with Quality Indicators reveals new insights into recommender systems (RSs). While EASE exhibits high accuracy, UserKNN emerges as a strong contender ofering diverse solutions across multiple objectives. Additionally, RP3 proved to be highly efective in the accuracy/algorithmic bias scenario.

Acknowledgements. The authors acknowledge partial support of the following projects: OVS: Fashion Retail Reloaded, Lutech Digitale 4.0, Secure Safe Apulia, Patti Territoriali WP1, BIO-D, and MOST - Centro Nazionale per la Mobilità Sostenibile. We also gratefully acknowledge the CINECA award under the ISCRA initiative, for the availability of HPC resources and support.

[1]

Paparella ,

D. Di

Palma ,

V. W.

Anelli ,

T. D.

Noia , Broadening the scope: Evaluating the potential of recommender systems beyond prioritizing accuracy , in: J. Zhang , L. Chen, S.

Berkovsky , M.

Zhang , T. D. Noia , J.

Basilico , L.

Pizzato , Y. Song (Eds.), Proceedings of the 17th ACM Conference on Recommender Systems, RecSys 2023 , Singapore, Singapore, September 18-22 , 2023 , ACM, 2023 , pp. 1139 - 1145 . URL: https://doi.org/10.1145/3604915. 3610649. doi: 10 .1145/3604915.3610649.

[2]

Paparella ,

V. W.

Anelli ,

Boratto ,

T. D.

Noia , Reproducibility of multi-objective reinforcement learning recommendation: Interplay between efectiveness and beyondaccuracy perspectives , in: J. Zhang , L. Chen, S.

Berkovsky , M.

Zhang , T. D. Noia , J.

Basilico , L.

Pizzato , Y. Song (Eds.), Proceedings of the 17th ACM Conference on Recommender Systems, RecSys 2023 , Singapore, Singapore, September 18-22 , 2023 , ACM, 2023 , pp. 467 - 478 . URL: https://doi.org/10.1145/3604915.3609493. doi: 10 .1145/3604915.3609493.

[3]

Vargas ,

Castells , Rank and relevance in novelty and diversity metrics for recommender systems , in: B. Mobasher , R. D.

Burke , D.

Jannach , G. Adomavicius (Eds.), Proceedings of the 2011 ACM Conference on Recommender Systems, RecSys 2011 , Chicago, IL, USA, October 23 - 27 , 2011 , ACM, 2011 , pp. 109 - 116 . URL: https://dl.acm.org/citation.cfm?id= 2043955 .

[4]

Di Palma , Retrieval-augmented recommender system: Enhancing recommender systems with large language models , in: RecSys, ACM, 2023 , pp. 1369 - 1373 .

[5]

Boratto , G. Fenu,

Marras , Interplay between upsampling and regularization for provider fairness in recommender systems, User Model . User Adapt. Interact . 31 ( 2021 ) 421 - 455 . URL: https://doi.org/10.1007/s11257-021-09294-8. doi: 10 .1007/ s11257-021-09294-8.

[6]

Di Palma ,

V. W.

Anelli ,

Malitesta ,

Paparella ,

Pomo ,

Deldjoo ,

T. D.

Noia , Examining fairness in graph-based collaborative filtering: A consumer and producer perspective , in: IIR , volume 3448 of CEUR Workshop Proceedings, CEUR-WS.org , 2023 , pp. 79 - 84 .

[7]

V. W.

Anelli ,

T. D.

Noia ,

E. D.

Sciascio ,

Pomo ,

Ragone , On the discriminative power of hyper-parameters in cross-validation and how to choose them , in: T. Bogers , A.

Said , P.

Brusilovsky , D. Tikk (Eds.), Proceedings of the 13th ACM Conference on Recommender Systems, RecSys 2019 , Copenhagen, Denmark, September 16-20 , 2019 , ACM, 2019 , pp. 447 - 451 . URL: https://doi.org/10.1145/3298689.3347010. doi: 10 .1145/3298689.3347010.

[8]

V. W.

Anelli ,

Bellogín ,

T. D.

Noia , C. Pomo, Reenvisioning the comparison between neural collaborative filtering and matrix factorization , in: H. J. C. Pampín , M. A.

Larson , M. C.

Willemsen , J. A.

Konstan , J. J.

McAuley , J.

Garcia-Gathright , B.

Huurnink , E. Oldridge (Eds.), RecSys '21: Fifteenth ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September 2021 - 1 October 2021 , ACM, 2021 , pp. 521 - 529 . URL: https://doi.org/10.1145/3460231.3475944. doi: 10 .1145/3460231.3475944.

[9]

Di Palma ,

G. M.

Biancofiore ,

V. W.

Anelli ,

Narducci ,

T. D.

Noia ,

E. D.

Sciascio , Evaluating chatgpt as a recommender system: A rigorous approach , CoRR abs/2309 .03613 ( 2023 ).

[10]

Marler ,

Arora , Survey of multi-objective optimization methods for engineering , Structural and Multidisciplinary Optimization 26 ( 2004 ) 369 - 395 . doi: 10 .1007/ s00158- 003- 0368- 6.

[11]

Paparella ,

V. W.

Anelli ,

F. M.

Nardini ,

Perego ,

T. D.

Noia , Post-hoc selection of pareto-optimal solutions in search and recommendation , in: I. Frommholz , F.

Hopfgartner , M.

Lee , M.

Oakes , M.

Lalmas , M.

Zhang , R. L. T. Santos (Eds.), Proceedings of the 32nd ACM International Conference on Information and Knowledge Management , CIKM 2023 , Birmingham, United Kingdom, October 21-25 , 2023 , ACM, 2023 , pp. 2013 - 2023 . URL: https://doi.org/10.1145/3583780.3615010. doi: 10 .1145/3583780.3615010.

[12]

Li ,

Yao , Quality evaluation of solution sets in multiobjective optimisation: A survey, ACM Computing Surveys (CSUR) 52 ( 2019 ) 1 - 38 .

[13]

Paparella , Pursuing optimal trade-of solutions in multi-objective recommender systems , in: J. Golbeck , F. M.

Harper , V.

Murdock , M. D.

Ekstrand , B.

Shapira , J.

Basilico , K. T.

Lundgaard , E. Oldridge (Eds.), RecSys '22: Sixteenth ACM Conference on Recommender Systems , Seattle, WA, USA, September 18 - 23 , 2022 , ACM, 2022 , pp. 727 - 729 . URL: https: //doi.org/10.1145/3523227.3547425. doi: 10 .1145/3523227.3547425.

[14]

Zitzler ,

Deb , L. Thiele, Comparison of multiobjective evolutionary algorithms: Empirical results , Evolutionary computation 8 ( 2000 ) 173 - 195 .

[15]

J. R.

Schott , Fault tolerant design using single and multicriteria genetic algorithm optimization ., Technical Report, Air force inst of tech Wright-Patterson afb OH , 1995 .

[16] D. A. Van Veldhuizen , Multiobjective evolutionary algorithms: classifications, analyses , and new innovations, Air Force Institute of Technology , 1999 .

[17]

Zitzler , L. Thiele, Multiobjective optimization using evolutionary algorithms-a comparative case study, in: Parallel Problem Solving from Nature-PPSN V: 5th International Conference Amsterdam, The Netherlands September 27-30 , 1998 Proceedings 5, Springer, 1998 , pp. 292 - 301 .

[18]

Wan ,

Misra ,

Nakashole , J. J. McAuley , Fine-grained spoiler detection from largescale review corpora , in: A. Korhonen , D. R. Traum , L. Màrquez (Eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics , ACL 2019 , Florence, Italy, July 28- August 2 , 2019 , Volume 1 :

Long

Papers , Association for Computational Linguistics , 2019 , pp. 2605 - 2610 . URL: https://doi.org/10.18653/v1/p19- 1248 . doi: 10 .18653/ v1/p19- 1248 .

[19]

F. M.

Harper ,

J. A.

Konstan , The movielens datasets: History and context , ACM Trans. Interact. Intell. Syst . 5 ( 2016 ) 19 : 1 - 19 : 19 . URL: https://doi.org/10.1145/2827872. doi: 10 . 1145/2827872.

[20]

Steck , Embarrassingly shallow autoencoders for sparse data , in: L. Liu,

R. W.

White ,

Mantrach ,

Silvestri , J. J. McAuley ,

Baeza-Yates , L. Zia (Eds.), The World Wide Web Conference, WWW 2019 , San Francisco, CA, USA, May 13 -17, 2019 , ACM, 2019 , pp. 3251 - 3257 . URL: https://doi.org/10.1145/3308558.3313710. doi: 10 .1145/3308558. 3313710.

[21]

Liang ,

R. G.

Krishnan ,

M. D.

Hofman , T. Jebara, Variational autoencoders for collaborative filtering , in: P. Champin , F.

Gandon , M.

Lalmas , P. G. Ipeirotis (Eds.), Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018 , Lyon, France, April 23-27 , 2018 , ACM, 2018 , pp. 689 - 698 . URL: https://doi.org/10.1145/3178876.3186150. doi: 10 .1145/3178876.3186150.

[22]

He ,

Deng ,

Wang ,

Li ,

Zhang ,

Wang , Lightgcn: Simplifying and powering graph convolution network for recommendation , in: J. X. Huang , Y. Chang , X. Cheng, J. Kamps , V.

Murdock , J.

Wen , Y. Liu (Eds.), Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , SIGIR 2020 ,

Virtual

Event , China, July 25-30 , 2020 , ACM, 2020 , pp. 639 - 648 . URL: https://doi.org/10. 1145/3397271.3401063. doi: 10 .1145/3397271.3401063.

[23]

Paudel ,

Christofel ,

Newell ,

Bernstein , Updatable, accurate, diverse, and scalable recommendations for interactive applications , ACM Trans. Interact. Intell. Syst . 7 ( 2017 ) 1: 1 - 1 : 34 . URL: https://doi.org/10.1145/2955101. doi: 10 .1145/2955101.

[24]

Resnick ,

Iacovou ,

Suchak ,

Bergstrom ,

Riedl , Grouplens: An open architecture for collaborative filtering of netnews , in: J. B. Smith , F. D.

Smith , T. W.

Malone (Eds.), CSCW '94, Proceedings of the Conference on Computer Supported Cooperative Work , Chapel Hill, NC , USA, October 22 - 26 , 1994 , ACM, 1994 , pp. 175 - 186 . URL: https://doi.org/ 10.1145/192844.192905. doi: 10 .1145/192844.192905.

[25]

V. W.

Anelli ,

Bellogín ,

Ferrara ,

Malitesta ,

F. A.

Merra ,

Pomo ,

F. M.

Donini ,

T. D.

Noia , Elliot: A comprehensive and rigorous framework for reproducible recommender systems evaluation , in: F. Diaz,

Shah ,

Suel ,

Castells ,

Jones , T. Sakai (Eds.), SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , Virtual Event, Canada, July 11-15 , 2021 , ACM, 2021 , pp. 2405 - 2414 . URL: https://doi.org/10.1145/3404835.3463245. doi: 10 .1145/3404835.3463245.

[26]

Jannach ,

Lerche , I. Kamehkhosh,

Jugovac , What recommenders recommend: an analysis of recommendation biases and possible countermeasures, User Model . User Adapt. Interact . 25 ( 2015 ) 427 - 491 . URL: https://doi.org/10.1007/s11257-015-9165-3. doi: 10 .1007/ s11257- 015- 9165- 3.

[27]

Abdollahpouri ,

Burke ,

Mobasher , Managing popularity bias in recommender systems with personalized re-ranking , in: R. Barták , K. W. Brawner (Eds.), Proceedings of the Thirty-Second International Florida Artificial Intelligence Research Society Conference, Sarasota, Florida, USA, May 19 -22 2019 , AAAI Press, 2019 , pp. 413 - 418 . URL: https://aaai. org/ocs/index.php/FLAIRS/FLAIRS19/paper/view/18199.