Diversity enhancement for collaborative filtering
recommendation
Liu Yankai1,∗
1
    China Mobile Research Institute, 32 Xuanwumen West Street, Beijing, 100000, China


                                             Abstract
                                             To evaluate the user experience of recommendation systems in realistic and complex scenarios, the EvalRS challenge evaluates
                                             recommendation algorithms from multiple perspectives such as fairness, diversity. This paper details the diversity-enhanced
                                             collaborative filtering recommendation algorithm that won first place in the EvalRS challenge. Our proposed solution has
                                             two essential innovations. First, the importance of the user’s historical behavior is ranked so as to obtain a high-ranking
                                             performance using fewer user behaviors. Second, the recommendation results are re-ranked to enhance the diversity of the
                                             recommendation results. In addition, this paper proposes a new evaluation metric, the quantile-based fairness Gini coefficient,
                                             to metric the fairness of the recommendation results, as it does not cause drastic fluctuations due to the small number of item
                                             interactions.

                                             Keywords
                                             collaborative filtering, recommendation evaluation, recommendation fairness


1. Introduction                                                                                                                       2. The Challenge
Conventional recommendation algorithm evaluation met-                                                                                 EvalRS is one of the challenges of CIKM 2022 AnalytiCup,
rics are usually ranking performance metrics such as hit                                                                              which is based on the reclist[2] and aims to evaluate rec-
rate, ndcg, MRR, etc., which may lead to recommendation                                                                               ommender systems in terms of key dimensions such as
systems considering only certain user preferences and ig-                                                                             diversity and fairness. The dataset of this challenge is
noring multiple aspects of user experience. Fairness and                                                                              based on LFM-1b[3], a music dataset with about 820,000
diversity are as important as ranking performance met-                                                                                songs, 110,000 users, and about 37 million user interac-
rics. If users frequently interact with a single type of item,                                                                        tions.The goal of the challenge is to evaluate the recom-
the user experience of the recommendation system will                                                                                 mendation system in multiple aspects, including Stan-
be damaged in the long run, which will eventually lead                                                                                dard recommendation system metrics, Standard metrics
to user churn. Therefore, the recommendation algorithm                                                                                on a per-group or slice basis, Behavioral and qualitative
model needs to be evaluated from multiple perspectives.                                                                               tests, etc.[1]
This paper illustrates the solution of EvalRS challenge[1],
including a collaborative filtering algorithm based on fre-                                                                           Table 1
quent item mining, importance ranking of user behavior                                                                                Results of various experiments
based on TF-IDF, and diversity-enhanced re-ranking al-
gorithm. The next section presents a brief introduction to                                                                                          Tests                exp1       exp2     exp3
this challenge, and section 3 details the solution and the                                                                             TRACK_POPULARITY_GINI             0.0046    0.0008    0.0008
fairness index based on the Gini coefficient; finally, the                                                                                    HIT_RATE                  0.0330     0.0154    0.0154
section 4 shows the experimental results. Furthermore,                                                                                          MRR                     0.0106     0.0066    0.0058
the specific solution code and documentation are pub-                                                                                      MRED_COUNTRY                 -0.0068   -0.0040    -0.0040
licly available on GitHub: https://github.com/lazy2pan-                                                                                 MRED_USER_ACTIVITY              -0.0085   -0.0069    -0.0069
da/cikm2022_solution.                                                                                                                  MRED_TRACK_POPULARITY            -0.0282   -0.0020    -0.0020
                                                                                                                                       MRED_ARTIST_POPULARITY           -0.0134   -0.0016     -0.0016
                                                                                                                                            MRED_GENDER                 -0.0027   -0.0009    -0.0009
                                                                                                                                         BEING_LESS_WRONG                0.3363    0.3240    0.4248
                                                                                                                                          LATENT_DIVERSITY              -0.1995   -0.2268    -0.1216
                                                                                                                                          AGGREGATE_SCORE               -3.7511    1.3990    1.7025

CIKM’22: Proceedings of the 31st ACM International Conference on
Information and Knowledge Management
∗
     Corresponding author.
Envelope-Open liuyankai@chinamobile.com (L. Yankai)
Orcid 0000-0001-7159-4975 (L. Yankai)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
3. Solution                                                     mendation. The Gini coefficient originates from the field
                                                                of economics and is often used to assess the degree of
3.1. Model Architecture                                         fairness of income distribution of residents. This paper
                                                                uses the Gini coefficient to assess the degree of variation
The main model used in this paper is the n-gram model,
                                                                in the accuracy of items across quartile intervals. Com-
which is mainly applied in the field of natural language
                                                                pared to the standard deviation, the Gini coefficient is
processing and is a statistical-based algorithm.[4] Its ba-
                                                                a more accurate reflection of the difference in fairness
sic idea is that the content inside the text is operated
                                                                between two pairs of items across quartile intervals. The
with a sliding window of size 𝑁, forming a sequence of
                                                                specific calculation is, firstly, dividing the trackid into
segments of length 𝑁. This method first sorts the user
                                                                multiple quantile intervals according to the popularity,
history sequence by time denote as 𝐻𝑢 and uses the n-
                                                                secondly, calculating the false positive rate (FPR) of the
gram algorithm to process each user history sequence,
                                                                trackid in different quantile intervals, and finally,the Gini
where we take 𝑁 in n-gram as 2; that is, each user history
                                                                coefficient is calculated as follows:
sequence is cut into multiple subsequences of length 2,
then the whole training set is 2-gram sequence sliced, and                            𝑛     𝑛
                                                                                    ∑𝑖=1 ∑𝑗=1 |𝐹 𝑃𝑅𝑖 − 𝐹 𝑃𝑅𝑗 |
calculate the frequency of all trackid pairs; for trackid                      𝐺=                                         (3)
                                                                                                     ̄
                                                                                              2 ∗ 𝐹 𝑃𝑅
pair (𝑖, 𝑗), its frequency is denoted as 𝐹 (𝑖, 𝑗); for track 𝑖,
the track with the highest co-occurrence frequency is
denoted as 𝑆(𝑖). Next, based on the user’s history 𝐻𝑢 , the 4. Experiments
similar track 𝑆(𝑖) of each user’s history is obtained, and
                                                                The experiments part shows the metrics obtained after
then the recommendation result of the user is obtained
                                                                several iterations. As shown in Table 1, the results of
by ranking all similar items according to their frequency.
                                                                Experiment 1 indicate that HIT_RATE and MRR are the
In order to reduce the popularity fairness of track in rec-
                                                                highest, fairness and diversity metrics are the lowest,
ommendation results, we use the TF-IDF value of track
                                                                and the final AGGREGATE_SCORE is the lowest when
to sort the user history sequence and truncate the user
                                                                no samples are performed on user history sequences.
history records, where TF-IDF is calculated as follows,
                      𝑢                                         Experiment 2 performs TF-IDF sorting on user history
for track 𝑡, where 𝐶𝑡 is the number of plays of 𝑡 in user 𝑢,
       𝑎𝑙𝑙                                                      sequences and uses only the top 8 tracks for inference.
and 𝐶𝑡 is the number of plays of 𝑡 in all users, where 𝐺
                                                                The results show that the fairness metrics are signifi-
denotes the total number of users and 𝐺(𝑡) denotes the
                                                                cantly improved compared to Experiment 1, with a 94%
number of users who have interacted with 𝑡.
                                𝑢                               improvement in the MRED_TRACK_POPULARITY met-
                              𝐶𝑡            𝐺
             𝑇 𝐹 − 𝐼 𝐷𝐹 (𝑡) =       ∗ (𝑙𝑜𝑔      + 1)       (1) ric. Based on Experiment 2, Experiment 3 conducted
                              𝐶𝑡𝑎𝑙𝑙        𝐺(𝑡)                 a diversity enhancement ranking. Compared with Ex-
Finally, diversity-enhanced reranking is performed on the periment 2, Behavioral and qualitative tests were signifi-
basis of the above recommendation results. The user’s cantly improved, with BEING_LESS_WRONG improved
recommendation results are denoted as 𝑅𝑢 , and the user’s by 31% and LATENT_DIVERSITY improved by 46%, and
history is recorded as 𝐻𝑢 . The EvalRS challenge for di- the others metrics remained the same as in Experiment
versity is defined as 𝐷(𝑅𝑢 ) = 0.3 ∗ 𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦 − 0.7 ∗ 𝑏𝑖𝑎𝑠, 2. The AGGREGATE_SCORE of Experiment 3 is 1.7025.
where 𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦 is defined as the sum of the differences In addition, TRACK_POPULARITY_GINI is our custom
between each point in the prediction space and the mean test metric whose value can reflect the fairness of track
of the prediction space, and 𝑏𝑖𝑎𝑠 is defined as the dis- popularity. Finally, since the competition requires that
tance between the ground truth vector and the mean of the submission must be run on the AWS EC2 p3.2xlarge
the prediction vector.[1] In this method, we combine the instance within 90 minutes, with the support of JIUTIAN
                                                                                                  1
MMR[5] diversity algorithm with the diversity definition Artificial Intelligence Platform , we completed the exper-
of EvalRS and use the vector mean of 𝐻𝑢 as the ground           iments  and  performance   tests.
truth vector to calculate the recommended result 𝑅𝑢 as
follows:
                                                                   5. Conclusion
𝑀 = arg max [0.3 ∗ diversity([𝑆, 𝑡𝑖 ])) − 0.7 ∗ bias([𝑆, 𝑡𝑖 ]))]
         𝑡𝑖 ∈𝑅𝑢 ∣𝑆                                                 This paper details the EvalRS challenge’s solution, which
                                                            (2)
                                                                   is based on the n-gram collaborative filtering algorithm
                                                                   and uses TF-IDF to rank the importance of users’ history
3.2. Quantile-based Gini coefficient                               records, followed by re-ranking for diversity, and finally
     fairness test                                                 achieves better results in aggregation metrics.
Based on the Gini coefficient, this paper proposes a new
test to calculate the fairness of track popularity recom-          1
                                                                       https://jiutian.10086.cn
References
[1] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio,
    C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs:
    a rounded evaluation of recommender systems,
    2022. URL: https://arxiv.org/abs/2207.05772. doi:10.
    48550/ARXIV.2207.05772 .
[2] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko, Be-
    yond NDCG: Behavioral testing of recommender
    systems with RecList, in: Companion Proceed-
    ings of the Web Conference 2022, ACM, 2022. URL:
    https://doi.org/10.1145%2F3487553.3524215. doi:10.
    1145/3487553.3524215 .
[3] M. Schedl, The lfm-1b dataset for music retrieval and
    recommendation, in: Proceedings of the 2016 ACM
    on international conference on multimedia retrieval,
    2016, pp. 103–110.
[4] W. B. Cavnar, J. M. Trenkle, et al., N-gram-based
    text categorization, in: Proceedings of SDAIR-94,
    3rd annual symposium on document analysis and
    information retrieval, volume 161175, Citeseer, 1994.
[5] J. Carbonell, J. Goldstein, The use of mmr, diversity-
    based reranking for reordering documents and pro-
    ducing summaries, in: Proceedings of the 21st an-
    nual international ACM SIGIR conference on Re-
    search and development in information retrieval,
    1998, pp. 335–336.