Diversity enhancement for collaborative filtering recommendation Liu Yankai1,∗ 1 China Mobile Research Institute, 32 Xuanwumen West Street, Beijing, 100000, China Abstract To evaluate the user experience of recommendation systems in realistic and complex scenarios, the EvalRS challenge evaluates recommendation algorithms from multiple perspectives such as fairness, diversity. This paper details the diversity-enhanced collaborative filtering recommendation algorithm that won first place in the EvalRS challenge. Our proposed solution has two essential innovations. First, the importance of the user’s historical behavior is ranked so as to obtain a high-ranking performance using fewer user behaviors. Second, the recommendation results are re-ranked to enhance the diversity of the recommendation results. In addition, this paper proposes a new evaluation metric, the quantile-based fairness Gini coefficient, to metric the fairness of the recommendation results, as it does not cause drastic fluctuations due to the small number of item interactions. Keywords collaborative filtering, recommendation evaluation, recommendation fairness 1. Introduction 2. The Challenge Conventional recommendation algorithm evaluation met- EvalRS is one of the challenges of CIKM 2022 AnalytiCup, rics are usually ranking performance metrics such as hit which is based on the reclist[2] and aims to evaluate rec- rate, ndcg, MRR, etc., which may lead to recommendation ommender systems in terms of key dimensions such as systems considering only certain user preferences and ig- diversity and fairness. The dataset of this challenge is noring multiple aspects of user experience. Fairness and based on LFM-1b[3], a music dataset with about 820,000 diversity are as important as ranking performance met- songs, 110,000 users, and about 37 million user interac- rics. If users frequently interact with a single type of item, tions.The goal of the challenge is to evaluate the recom- the user experience of the recommendation system will mendation system in multiple aspects, including Stan- be damaged in the long run, which will eventually lead dard recommendation system metrics, Standard metrics to user churn. Therefore, the recommendation algorithm on a per-group or slice basis, Behavioral and qualitative model needs to be evaluated from multiple perspectives. tests, etc.[1] This paper illustrates the solution of EvalRS challenge[1], including a collaborative filtering algorithm based on fre- Table 1 quent item mining, importance ranking of user behavior Results of various experiments based on TF-IDF, and diversity-enhanced re-ranking al- gorithm. The next section presents a brief introduction to Tests exp1 exp2 exp3 this challenge, and section 3 details the solution and the TRACK_POPULARITY_GINI 0.0046 0.0008 0.0008 fairness index based on the Gini coefficient; finally, the HIT_RATE 0.0330 0.0154 0.0154 section 4 shows the experimental results. Furthermore, MRR 0.0106 0.0066 0.0058 the specific solution code and documentation are pub- MRED_COUNTRY -0.0068 -0.0040 -0.0040 licly available on GitHub: https://github.com/lazy2pan- MRED_USER_ACTIVITY -0.0085 -0.0069 -0.0069 da/cikm2022_solution. MRED_TRACK_POPULARITY -0.0282 -0.0020 -0.0020 MRED_ARTIST_POPULARITY -0.0134 -0.0016 -0.0016 MRED_GENDER -0.0027 -0.0009 -0.0009 BEING_LESS_WRONG 0.3363 0.3240 0.4248 LATENT_DIVERSITY -0.1995 -0.2268 -0.1216 AGGREGATE_SCORE -3.7511 1.3990 1.7025 CIKM’22: Proceedings of the 31st ACM International Conference on Information and Knowledge Management ∗ Corresponding author. Envelope-Open liuyankai@chinamobile.com (L. Yankai) Orcid 0000-0001-7159-4975 (L. Yankai) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 3. Solution mendation. The Gini coefficient originates from the field of economics and is often used to assess the degree of 3.1. Model Architecture fairness of income distribution of residents. This paper uses the Gini coefficient to assess the degree of variation The main model used in this paper is the n-gram model, in the accuracy of items across quartile intervals. Com- which is mainly applied in the field of natural language pared to the standard deviation, the Gini coefficient is processing and is a statistical-based algorithm.[4] Its ba- a more accurate reflection of the difference in fairness sic idea is that the content inside the text is operated between two pairs of items across quartile intervals. The with a sliding window of size 𝑁, forming a sequence of specific calculation is, firstly, dividing the trackid into segments of length 𝑁. This method first sorts the user multiple quantile intervals according to the popularity, history sequence by time denote as 𝐻𝑢 and uses the n- secondly, calculating the false positive rate (FPR) of the gram algorithm to process each user history sequence, trackid in different quantile intervals, and finally,the Gini where we take 𝑁 in n-gram as 2; that is, each user history coefficient is calculated as follows: sequence is cut into multiple subsequences of length 2, then the whole training set is 2-gram sequence sliced, and 𝑛 𝑛 ∑𝑖=1 ∑𝑗=1 |𝐹 𝑃𝑅𝑖 − 𝐹 𝑃𝑅𝑗 | calculate the frequency of all trackid pairs; for trackid 𝐺= (3) ̄ 2 ∗ 𝐹 𝑃𝑅 pair (𝑖, 𝑗), its frequency is denoted as 𝐹 (𝑖, 𝑗); for track 𝑖, the track with the highest co-occurrence frequency is denoted as 𝑆(𝑖). Next, based on the user’s history 𝐻𝑢 , the 4. Experiments similar track 𝑆(𝑖) of each user’s history is obtained, and The experiments part shows the metrics obtained after then the recommendation result of the user is obtained several iterations. As shown in Table 1, the results of by ranking all similar items according to their frequency. Experiment 1 indicate that HIT_RATE and MRR are the In order to reduce the popularity fairness of track in rec- highest, fairness and diversity metrics are the lowest, ommendation results, we use the TF-IDF value of track and the final AGGREGATE_SCORE is the lowest when to sort the user history sequence and truncate the user no samples are performed on user history sequences. history records, where TF-IDF is calculated as follows, 𝑢 Experiment 2 performs TF-IDF sorting on user history for track 𝑡, where 𝐶𝑡 is the number of plays of 𝑡 in user 𝑢, 𝑎𝑙𝑙 sequences and uses only the top 8 tracks for inference. and 𝐶𝑡 is the number of plays of 𝑡 in all users, where 𝐺 The results show that the fairness metrics are signifi- denotes the total number of users and 𝐺(𝑡) denotes the cantly improved compared to Experiment 1, with a 94% number of users who have interacted with 𝑡. 𝑢 improvement in the MRED_TRACK_POPULARITY met- 𝐶𝑡 𝐺 𝑇 𝐹 − 𝐼 𝐷𝐹 (𝑡) = ∗ (𝑙𝑜𝑔 + 1) (1) ric. Based on Experiment 2, Experiment 3 conducted 𝐶𝑡𝑎𝑙𝑙 𝐺(𝑡) a diversity enhancement ranking. Compared with Ex- Finally, diversity-enhanced reranking is performed on the periment 2, Behavioral and qualitative tests were signifi- basis of the above recommendation results. The user’s cantly improved, with BEING_LESS_WRONG improved recommendation results are denoted as 𝑅𝑢 , and the user’s by 31% and LATENT_DIVERSITY improved by 46%, and history is recorded as 𝐻𝑢 . The EvalRS challenge for di- the others metrics remained the same as in Experiment versity is defined as 𝐷(𝑅𝑢 ) = 0.3 ∗ 𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦 − 0.7 ∗ 𝑏𝑖𝑎𝑠, 2. The AGGREGATE_SCORE of Experiment 3 is 1.7025. where 𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦 is defined as the sum of the differences In addition, TRACK_POPULARITY_GINI is our custom between each point in the prediction space and the mean test metric whose value can reflect the fairness of track of the prediction space, and 𝑏𝑖𝑎𝑠 is defined as the dis- popularity. Finally, since the competition requires that tance between the ground truth vector and the mean of the submission must be run on the AWS EC2 p3.2xlarge the prediction vector.[1] In this method, we combine the instance within 90 minutes, with the support of JIUTIAN 1 MMR[5] diversity algorithm with the diversity definition Artificial Intelligence Platform , we completed the exper- of EvalRS and use the vector mean of 𝐻𝑢 as the ground iments and performance tests. truth vector to calculate the recommended result 𝑅𝑢 as follows: 5. Conclusion 𝑀 = arg max [0.3 ∗ diversity([𝑆, 𝑡𝑖 ])) − 0.7 ∗ bias([𝑆, 𝑡𝑖 ]))] 𝑡𝑖 ∈𝑅𝑢 ∣𝑆 This paper details the EvalRS challenge’s solution, which (2) is based on the n-gram collaborative filtering algorithm and uses TF-IDF to rank the importance of users’ history 3.2. Quantile-based Gini coefficient records, followed by re-ranking for diversity, and finally fairness test achieves better results in aggregation metrics. Based on the Gini coefficient, this paper proposes a new test to calculate the fairness of track popularity recom- 1 https://jiutian.10086.cn References [1] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs: a rounded evaluation of recommender systems, 2022. URL: https://arxiv.org/abs/2207.05772. doi:10. 48550/ARXIV.2207.05772 . [2] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko, Be- yond NDCG: Behavioral testing of recommender systems with RecList, in: Companion Proceed- ings of the Web Conference 2022, ACM, 2022. URL: https://doi.org/10.1145%2F3487553.3524215. doi:10. 1145/3487553.3524215 . [3] M. Schedl, The lfm-1b dataset for music retrieval and recommendation, in: Proceedings of the 2016 ACM on international conference on multimedia retrieval, 2016, pp. 103–110. [4] W. B. Cavnar, J. M. Trenkle, et al., N-gram-based text categorization, in: Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, volume 161175, Citeseer, 1994. [5] J. Carbonell, J. Goldstein, The use of mmr, diversity- based reranking for reordering documents and pro- ducing summaries, in: Proceedings of the 21st an- nual international ACM SIGIR conference on Re- search and development in information retrieval, 1998, pp. 335–336.