=Paper=
{{Paper
|id=Vol-3924/short5
|storemode=property
|title=Addressing bias in Recommender Systems: A Case Study on Data Debiasing Techniques in Mobile Games
|pdfUrl=https://ceur-ws.org/Vol-3924/short5.pdf
|volume=Vol-3924
|authors=Yixiong Wang,Maria Paskevich,Hui Wang
|dblpUrl=https://dblp.org/rec/conf/robustrecsys/WangPW24
}}
==Addressing bias in Recommender Systems: A Case Study on Data Debiasing Techniques in Mobile Games==
<pdf width="1500px">https://ceur-ws.org/Vol-3924/short5.pdf</pdf>
<pre>
                         Addressing bias in Recommender Systems: A Case Study on Data
                         Debiasing Techniques in Mobile Games
                         Yixiong Wang1,† , Maria Paskevich1,∗,† and Hui Wang1
                         1
                             King, Malmskillnadsgatan 19, 111 57 Stockholm, Sweden


                                           Abstract
                                           The mobile gaming industry, particularly the free-to-play sector, has been around for more than a decade, yet it still experiences rapid
                                           growth. The concept of games-as-service requires game developers to pay much more attention to recommendations of content in their
                                           games. With recommender systems (RS), the inevitable problem of bias in the data comes hand in hand. A lot of research has been
                                           done on the case of bias in RS for online retail or services, but much less is available for the specific case of the game industry. Also,
                                           in previous works, various debiasing techniques were tested on explicit feedback datasets, while it is much more common in mobile
                                           gaming data to only have implicit feedback. This case study aims to identify and categorize potential bias within datasets specific to
                                           model-based recommendations in mobile games, review debiasing techniques in the existing literature, and assess their effectiveness on
                                           real-world data gathered through implicit feedback. The effectiveness of these methods is then evaluated based on their debiasing
                                           quality, data requirements, and computational demands.

                                           Keywords
                                           Recommender systems, In-game recommendation, Debiasing, Mobile games


                         1. Introduction                                                                                              publicly available datasets; (3) to adapt and apply debias-
                                                                                                                                      ing strategies, originally developed for explicit feedback
                         In the context of mobile gaming, delivery of content to play-                                                data, to the implicit feedback data specific to King, and (4)
                         ers through recommendations plays an important role. It                                                      to evaluate and compare the efficacy of different methods
                         could include elements such as, for example, in-game store                                                   based on the quality of debiasing, data requirements, and
                         products or certain parts of content. However, RSs used                                                      computational complexity.
                         within this context are susceptible to bias due to (1) lim-
                         ited exposure: unlike in webshops (e.g. Amazon), available
                         placements for sellable products in mobile games are often                                                   2. Related work
                         limited, and showing one product to a user means that alter-
                         natives would not be displayed; (2) the common approach of                                                   The existing literature on addressing debiasing techniques in
                         segmenting content through fixed heuristics before adopting                                                  RS presents a well-structured and categorized list of method-
                         RS introduces biases in the training data, which influences                                                  ologies [2][3]. It suggests that the selection of particular
                         the development of these models. Traditionally, at King                                                      debiasing techniques should depend on the specific types
                         we have been addressing these biases by either training                                                      of bias present in the data, as well as on the availability of
                         models on biased data, or by establishing holdout groups                                                     unbiased data samples. In recommender systems for mobile
                         of users who would receive random recommendations for                                                        games, various types of bias can arise, including but not
                         a period of time in order to collect a uniform dataset that                                                  limited to selection bias, exposure bias, position bias, and
                         reflects user preference in an unbiased way. Although the                                                    conformity bias. Some of the relevant methods to debias the
                         second approach allows the collection of unbiased data, it                                                   data in these cases could be The Inverse Propensity Scoring
                         could compromise user experience for a segment of players,                                                   (IPS) [4] method, which deals with selection and exposure
                         and may lead to significant operational costs and poten-                                                     biases by weighting observations inversely to their selection
                         tial revenue losses. In previous studies, researchers have                                                   probability, and does so without need for unbiased data. Yet
                         primarily focused on data derived from explicit feedback,                                                    the method could potentially result in high variance due
                         where users rate items using a numerical scale, and vari-                                                    to the challenges in accurately estimating propensities. Po-
                         ous debiasing techniques are tested on this data. However,                                                   tential solutions to the high variance issue of IPS method
                         within the realm of mobile gaming, obtaining explicit feed-                                                  include, for example, using Doubly Robust (DR) learning
                         back affects from user experience, making it challenging to                                                  [5] that introduces a novel approach to loss functions as
                         collect. As an alternative, data is often collected through                                                  a combination of IPS-based models with imputation-based
                         implicit feedback [1], where user preferences are inferred                                                   models. The combination of two models assures doubly
                         from behaviors such as impressions, purchases, and other                                                     robustness property when either of the two components
                         interactions. Given these challenges, our objectives in this                                                 (propensity estimation or imputed data) remains accurate.
                         study are: (1) to identify and categorize potential bias within                                              This method, though, relies on having an unbiased data
                         our datasets; (2) to conduct a review of existing literature                                                 sample to work. Another option is model-agnostic and bias-
                         on debiasing techniques and assess their effectiveness on                                                    agnostic solutions like AutoDebias [6], which are based on
                                                                                                                                      meta-learning to dynamically assign weights within the RS,
                                                                                                                                      aiming to neutralize biases across the board. A potential
                         RobustRecSys: Design, Evaluation, and Deployment of Robust Recom-                                            benefit of such solution is that it doesn’t require knowing the
                         mender Systems Workshop @ RecSys 2024, 18 October, 2024, Bari, Italy.
                         ∗
                              Corresponding author.                                                                                   types of bias present in the data, but as a downside, it also
                         †
                             These authors contributed equally.                                                                       relies on randomized samples. In addition, the process of fit-
                         Envelope-Open wyx.ei.99@gmail.com (Y. Wang); maria.paskevich@king.com                                        ting multiple models makes training more computationally
                         (M. Paskevich); maddy.hui.wang@king.com (H. Wang)                                                            demanding. Despite the advances and variety of available
                         Orcid 0000-0001-8904-2052 (Y. Wang); 0009-0006-6211-1824                                                     debiasing techniques, applying Recommendation Systems
                         (M. Paskevich); 0009-0004-4190-9410 (H. Wang)
                                       © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                                                                                                      to mobile gaming content remains a relatively untapped
                                       Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                   Table 1
                                                                   The sizes and feedback types of all datasets used in this study.
                                                                   A key difference is that the open datasets (COAT and YahooR3!)
                                                                   provide explicit feedback, while the proprietary datasets (A, B,
                                                                   and C) offer only implicit feedback (purchase/no purchase). Set A,
                                                                   a proprietary dataset, lacks randomized data, limiting debiasing
                                                                   options.

                                                                    Dataset       Biased samples     Unbiased samples      Feedback type
                                                                    COAT               311k                  54k               Explicit
                                                                    yahooR3!           12.5k                 75k               Explicit
                                                                    Set A              47.6k                  -                Implicit
                                                                    Set B              100k                 218k               Implicit
                                                                    Set C              980k                1.2mln              Implicit


                                                                   some items displayed in more appealing placements while
                                                                   others are not visible on the first screen (Fig. 1). Another
Figure 1: Examples of content placements in Candy Crush Soda       bias, selection bias, arises from imbalanced product impres-
Saga (left) and Candy Crush Saga (right), highlighting biases:     sions, where certain items—such as conversion offers—are
selection bias with a prominently placed product (left) and ex-
                                                                   shown to users more frequently, resulting in significantly
posure bias with limited visibility, where products are hidden
behind the ”More Offers” button (right).                           higher exposure for those items.

                                                                   3.2. Selection of Debiasing techniques
area, with most of the publications focusing on building           The primary reasoning for the selection of debiasing tech-
recommendations [7] [8] [9], and not on issues of imbalance        niques for this study was based in a literature review, and
and bias. Previous efforts at King introduced DFSNet [10],         included the applicability of each method to the specific
an end-to-end model-specific debiasing technique that en-          biases present in the propreitery datasets—namely, selec-
ables training an in-game recommender on an imbalanced             tion bias, exposure bias, and position bias. Further, it was
dataset without randomized data. This work aims to en-             imperative to evaluate techniques across two dimensions:
rich King’s debiasing toolkit by exploring model-agnostic          those that require randomized datasets and those that do
solutions, specifically focusing on the challenges of content      not, as well as to examine methodologies that are agnostic
recommendations within mobile games. However, the archi-           to any particular type of bias. Given the identified biases in
tecture of DFSNet is complex, involving multiple modules,          the datasets, we adopted several debiasing techniques: (1)
which can make the implementation and maintenance chal-            Matrix Factorisation (MF) as a baseline model, Inverse
lenging. Moreover, it requires constant feedback loops over        Propensity Scoring (IPS), a method that does not require
time and the model’s performance is highly dependent on            randomized data collection and primarily addresses selec-
the quality and recency of the training data.                      tion and exposure biases. (2) Doubly Robust learning,
                                                                   that tackles the same biases but, unlike IPS, requires a ran-
                                                                   domized dataset. And (3) AutoDebias (DR), a bias-agnostic
3. Methodology                                                     technique that also needs randomized data. Each method
                                                                   was tested across all datasets to evaluate model performance
3.1. Datasets                                                      and complexity. We initially applied MF to biased dataset 𝐷𝑇
Our study utilized two public datasets (COAT[4], ya-               to establish metrics for comparison, we denote our baseline
hooR3![13]) to validate theoretical results and three pro-         model as MF(biased), then compared these outcomes with
prietary datasets from King (Set A, Set B, Set C) that are         the results from the debiasing methods.
focused on user-item interactions in game shops within
Match-3 Game A and Match-3 Game B (Fig.1). The sizes of            3.3. Evaluation metrics
each dataset, along with their respective feedback types, are
provided in Table 1. We aimed to observe the effectiveness         For models’ evaluation, we use metrics that assess both
of different techniques on datasets collected with explicit        predictive power of the models (RMSE and AUC), as well as
feedback (public datasets), and those with implicit feedback       quality of ranking (NDCG@5) and inequality and diversity
(King’s datasets). Explicit feedback is typically collected by     in the recommendations (Gini index and Entropy):
asking users to rate items on a numerical scale, for example
                                                                        • NDCG@5 assesses the model’s ability to rank rele-
from 1 to 5, where 1 indicates disinterest, 2 signifies dissat-
                                                                          vant items in the recommendation list:
isfaction, and 5 shows a preference. In contrast, Implicit
feedback (as in the proprietary datasets) involves a binary                                                           𝑘
                                                                                          DCG@k                            2𝑟𝑒𝑙𝑖 − 1
response from users: purchase or non-purchase. This setup                  NDCG@k =              ,      DCG@k = ∑                     ,
                                                                                          IDCG@k                     𝑖=1 log2 (𝑖 + 1)
makes it harder to accurately measure user preferences. As
discussed in the Introduction, mobile games often have lim-                where IDCG@k is the ideal DCG@k and 𝑟𝑒𝑙𝑖 repre-
ited space for displaying sellable products, which is the case             sents items ordered by their relevance up to position
for all three proprietary datasets. This limitation leads to ex-           k.
posure bias in the data. Additionally, placement of different
products within the game shop creates positional bias, with
     Figure 2: Debiasing results on open datasets (COAT and yahooR3!). The graphs show the percentage change in metrics
     (AUC, RMSE, NDCG@5, Gini, and Entropy) for various models relative to MF(biased). AUC is plotted against other metrics to
     demonstrate the trade-off between diversity gains in recommendation systems and potential compromises in predictive power.
     Different models are represented by colors, training times are indicated by point sizes, and dataset types are distinguished by
     shapes.


Table 2
Percentage improvement of various models compared to MF(biased) across open datasets. The best results for each metric are
highlighted in bold.

          Dataset        Model                RMSE        AUC     NDCG       Gini     Entropy    Training time (sec)
                         IPS                  -2.53%     -0.26%   -1.18%    0.62%      -0.29%           8.82%
          COAT           DR                    3.86%     -1.57%    2.75%   -18.88%     6.16%           194.12%
                         AutoDebias           -5.06%     0.39%    3.73%     0.16%       0.00%          767.65%
                         IPS                -29.70%      -0.55%    0.73%    -6.33%      0.82%         -22.98%
          yahooR3!       DR                 -30.39%      -0.83%    0.00%     1.22%     -0.12%         412.56%
                         AutoDebias         -36.89%      1.79%    20.70%   -58.15%     4.26%          3215.87%


• RMSE measures the magnitude of prediction errors                       Additionally, we include Training Time, defined as the
  of exact rating predictions:                                        time required for each model to reach saturation, measured
                                                                      in seconds. This metric provides insights into the compu-
                               1                                      tational complexity and the resources required by different
              RMSE =               ∑ (𝑟 ̂ − 𝑟 )2 ,
                              |𝑅| (𝑢,𝑖)∈𝑅 𝑢𝑖 𝑢𝑖                       methodologies.
                       √
  where |𝑅| denotes the total number of ratings in the
  dataset, 𝑟𝑢𝑖
            ̂ and 𝑟𝑢𝑖 are predicted and true ratings for              4. Experimentation
  all user-item pairs (𝑢, 𝑖).
• AUC reflects how well the model distinguishes be-                   We regard biased data as training set, 𝐷𝑇 . When it comes
  tween positive and negative interactions:                           to randomized data, following the strategies as mentioned
                                             +
                                         (|𝐷te        +
                                               |+1)⋅|𝐷te |
                                                                      in [11], we split it into 3 parts: 5% for randomised set 𝐷𝑈
                    ∑(𝑢,𝑖)∈𝐷te+ rank𝑢,𝑖 −        2                    to help training as required by DR and Autodebias, 5% for
         AUC =              + | ⋅ (|𝐷 | − |𝐷 + |)
                                                           ,          validation set 𝐷𝑉 to tune hyper-parameters and incur early-
                          |𝐷te       te     te
                                                                      stopping mechanism to prevent overfitting, the rest 90% for
  where 𝐷te  + is the number of positive samples in test
                                                                      test set 𝐷𝑇 𝑒 to evaluate the model. For conformity reasons,
  set 𝐷te , and 𝑟𝑎𝑛𝑘𝑢,𝑖 denotes the position of a positive            the data split strategy mentioned above is applied to both
  feedback (𝑢, 𝑖). In experimentation, AUC mainly                     open datasets and proprietary datasets. For this project, we
  served as a metric to prevent overfitting and help                  deploy a training pipeline on Vertex AI [12], integrating
  fine-tunning in validation phase.                                   components such as data transformation powered by Big-
• Gini index measures inequality in the recommen-                     Query, model training and evaluation, as well as experiment
  dations distribution. The higher coefficient indicates              tracking. The training pipeline retrieves data from the data
  higher inequality                                                   warehouse to train models and produces artifacts that are
                          𝑛                                           later integrated into an experiment tracker. By adopting
                        ∑𝑖=1 (2𝑖 − 𝑛 − 1) 𝜙(𝑖)
                 𝐺=                                                   this artifact-based approach, we address the inherent chal-
                                   𝑛
                              𝑛 ⋅ ∑𝑖=1 𝜙(𝑖)                           lenge of reproducibility in operationalizing ML projects, as
  Where 𝜙𝑖 is the popularity score of the 𝑖-th item, with             it provides all the necessary components to reproduce ex-
  the scores 𝜙𝑖 arranged in ascending order (𝜙𝑖 ≤ 𝜙𝑖+1 ),             periments. Each experiment is run up to 10 times on Vertex
  and 𝑛 represents the total number of items.                         AI with the same hyper parameters, but varying random
• Entropy measures the diversity in the distribution                  seeds to get estimation on the variability of the results.
  of recommended items with higher values indicating                     A pipeline plays a pivotal role in enhancing machine
  higher diversity.                                                   learning processes within the industry by automating each
                                   𝑛
                                                                      step from data fetching to model evaluation. For this project,
                 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ∑ 𝑝𝑖 log(𝑝𝑖 ),                           a training pipeline was implemented on Vertex AI, encom-
                                  𝑖=1                                 passing components such as data transformation utilizing
  where 𝑛 is a total number of items u in a dataset and               BigQuery, model training, model evaluation, and experi-
  𝑝𝑖 is a probability of an item being recommended.                   ment tracking. All the experiments were conducted within
          Figure 3: Debiasing results on internal datasets (Set A, Set B and Set C). The graphs show the percentage change in metrics
          (AUC, RMSE, NDCG@5, Gini, and Entropy) for various models relative to MF(biased). AUC is plotted against other metrics to
          demonstrate the trade-off between diversity gains in recommendation systems and potential compromises in predictive power.
          Different models are represented by colors, training times are indicated by point sizes, and dataset types are distinguished by
          shapes.


     Table 3
     Percentage improvement of various models compared to MF(biased) across internal datasets. The best results for each metric
     are highlighted in bold.

                Dataset    Model            RMSE        AUC       NDCG         Gini     Entropy     Training time (sec)
                Set A      IPS              20.95%     -0.97%      -1.53%     -3.06%     0.41%            -4.72%
                           IPS              -8.61%      3.18%      -0.14%     3.29%      -0.02%          -12.23%
                Set B      DR              -45.40%     7.07%       0.68%     -0.54%      0.00%           386.46%
                           AutoDebias      -26.46%     -1.25%      -0.48%     3.26%      -0.02%          -63.26%
                           IPS              39.01%     -23.46%    -29.36%    -9.47%      9.04%           -15.50%
                Set C      DR               7.74%      -13.76%    -28.44%    -5.36%       5.47%           14.74%
                           AutoDebias       64.50%      2.61%     -0.01%      1.72%      -2.47%          233.93%


this framework, ensuring consistency, efficiency, and preci-           5.2. Internal Datasets
sion throughout the development lifecycle.
                                                                       For the internal datasets, the results are less consistent
                                                                       across the datasets and debiasing techniques (Table 3). This
5. Experimentation results                                             may be due to the fact that internal datasets employed im-
                                                                       plicit feedback when collecting data, where user preferences
The absolute results of all experiments, including confi-              are inferred from their impression and purchase records.
dence intervals, are presented in Table 4. In this section,            This can introduce biases due to the lack of negative sam-
we report the percentage improvement of various debias-                ples and overrepresentation of user interactions, potentially
ing techniques compared to the baseline model, which was               skewing the models towards popular items.
trained on biased data (MF(biased) model).                                Set A is a relatively small dataset (Table 1), and the lack of
                                                                       randomized data limits our options to only using IPS. As a re-
5.1. Open Datasets                                                     sult, some metrics, such as RMSE and AUC, actually worsen
                                                                       (Table 3), which we might accept as a trade-off to achieve
For the COAT dataset, the results show varying degrees of              better balance in recommendations. However, NDCG@5
improvement across different metrics (Table 2). The top per-           also does not improve. On the positive side, IPS enhances
forming method (AutoDebias), exhibited the best improve-               diversity metrics, with Gini improving by 3.06% and En-
ments in RMSE (-5.06%), AUC (0.39%) and NGCG@5 (3.73%)                 tropy by 0.41%, while also reducing computational cost by
with low changes in Gini (0.16%) and no improvement in En-             4.27%. Overall, applying this method increases model diver-
tropy. DR also provided higher gains in NDCG@5 (2.75%),                sity with comparable training time, but comes at the cost of
and performed better in Gini (-18.88%) and Entropy (6.16%),            accuracy.
but at a cost of higher RMSE (3.86%) and lower AUC (-1.57%).              Set B demonstrates substantial improvements with DR,
While AutoDebias outperformed other techniques when it                 including a 45.40% reduction in RMSE, a 7.07% increase in
comes to improving predictive power of the model (AUC,                 AUC, and gains in NDCG@5 (0.68%) and Gini (-0.54%), mak-
RMSE), it was not very efficient in terms of Gini and En-              ing the model perform better in both accuracy and diversity.
tropy, and has a significantly higher computational cost.              However, this comes at a significant computational cost,
This highlights a trade-off between improved accuracy and              increasing training time by 386.46%. Given the total number
increased resource requirements.                                       of samples being 318k, this leads to a considerably longer
   For YahooR3! dataset, again, AutoDebias results in                  training process. AutoDebias ranks second in RMSE im-
the highest improvement in RMSE (-36.89%), AUC (1.79%),                provement (-26.46%), while IPS shows a positive gain in
NDCG@5 (20.70%), as well as Gini (-58.15%) and Entropy                 AUC (3.18%). However, DR is the only method that consis-
(4.26%), but did so also with dramatically increased compu-            tently improves outcomes of NDCG@5, Gini, and Entropy.
tational cost (3216%). IPS provides a balanced performance                For Set C, the largest dataset with nearly 2.2 million
with improvements in RMSE (-29.70%) and Entropy (0.82%)                samples, AutoDebias achieves the highest improvement
at a lower computational cost (-22.98%), making it a practical         in AUC (2.61%) and maintains stable NDCG@5. However,
choice for resource-constrained environments.                          it underperforms compared to the baseline and other tech-
    Table 4
    Performance metrics across different models and datasets, with 95% confidence intervals.
    Dataset     Model              RMSE            AUC          NDCG@5           Gini           Entropy      Training time (sec)
                MF (uniform)     1.00 ± 0.02    0.54 ± 0.01     0.36 ± 0.02   0.64 ± 0.01      4.91 ± 0.02      2.00 ± 1.60
                MF (biased)      0.75 ± 0.01    0.77 ± 0.01     0.51 ± 0.01   0.64 ± 0.04       4.9 ± 0.11      3.40 ± 1.00
    COAT        IPS              0.73 ± 0.01    0.76 ± 0.01     0.50 ± 0.01   0.65 ± 0.04      4.89 ± 0.10      3.70 ± 2.30
                DR               0.78 ± 0.02    0.75 ± 0.01     0.52 ± 0.01   0.52 ± 0.01      5.20 ± 0.03      10.00 ± 6.90
                AutoDebias       0.71 ± 0.01    0.77 ± 0.02     0.53 ± 0.01   0.64 ± 0.06      4.90 ± 0.14      29.50 ± 9.6
                MF (uniform)     0.73 ± 0.01    0.57 ± 0.01     0.43 ± 0.01   0.41 ± 0.01      6.58 ± 0.01      4.80 ± 1.20
                MF (biased)      0.86 ± 0.01    0.73 ± 0.01     0.55 ± 0.01   0.41 ± 0.01      6.58 ± 0.01     60.50 ± 12.20
    yahooR3!    IPS              0.61 ± 0.01    0.72 ± 0.01     0.55 ± 0.01   0.39 ± 0.01      6.63 ± 0.02     46.60 ± 16.10
                DR               0.60 ± 0.04    0.72 ± 0.01     0.55 ± 0.01   0.42 ± 0.01      6.57 ± 0.01    310.10 ± 54.60
                AutoDebias       0.54 ± 0.01    0.74 ± 0.01     0.66 ± 0.01   0.17 ± 0.01      6.86 ± 0.01   2006.10 ± 1541.00
                MF (biased)      0.82 ± 0.07    0.54 ± 0.02     0.56 ± 0.02   0.36 ± 0.01      2.83 ± 0.01    694.30 ± 163.30
    Set A
                IPS              0.99 ± 0.02    0.54 ± 0.01     0.55 ± 0.01   0.35 ± 0.02      2.84 ± 0.02      661.5 ± 85.9
                MF (uniform)     0.61 ± 0.00    0.92 ± 0.01     0.97 ± 0.00   0.10 ± 0.00      1.77 ± 0.00    2891.00 ± 126.90
                MF (biased)      0.81 ± 0.06    0.89 ± 0.00     0.97 ± 0.00   0.10 ± 0.00      1.80 ± 0.00     2123.90 ± 441.3
    Set B       IPS              0.74 ± 0.14    0.92 ± 0.01     0.97 ± 0.00   0.10 ± 0.00      1.77 ± 0.00     1864.10 ± 86.70
                DR               0.44 ± 0.02    0.95 ± 0.01     0.96 ± 0.01   0.10 ± 0.01      1.77 ± 0.00   10332.00 ± 2486.30
                AutoDebias       0.56 ± 0.02    0.88 ± 0.01     0.96 ± 0.01   0.10 ± 0.00      1.77 ± 0.00    780.30 ± 153.70
                MF (uniform)     0.92 ± 0.04    0.25 ± 0.02     0.07 ± 0.01   0.52 ± 0.01      2.52 ± 0.02    775.90 ± 265.00
                MF (biased)      0.62 ± 0.01    0.84 ± 0.01     0.80 ± 0.01   0.65 ± 0.01      2.18 ± 0.02    650.80 ± 114.70
    Set C       IPS              0.86 ± 0.06    0.64 ± 0.05     0.56 ± 0.08   0.59 ± 0.01      2.37 ± 0.02   549.90 ± 128.30
                DR               0.67 ± 0.02    0.72 ± 0.05     0.57 ± 0.09   0.61 ± 0.02      2.29 ± 0.05    746.70 ± 140.00
                AutoDebias       1.02 ± 0.03    0.86 ± 0.04     0.78 ± 0.02   0.66 ± 0.02      2.12 ± 0.04   2173.20 ± 1826.10


niques in RMSE, Gini, Entropy, and training time, which              References
increases significantly by 233.93%. IPS, on the other hand,
delivers poor results in RMSE (39.01%), AUC (-23.46%), and           [1] Oard, Douglas W.& Jinmook Kim. Implicit feedback for
NDCG@5 (-29.36%), but excels in Gini (-9.47%) and Entropy                recommender systems. Proceedings of the AAAI work-
(9.04%) without adding to the training time.                             shop on recommender systems. Vol. 83. 1998.
                                                                     [2] Jiawei Chen and Hande Dong and Xiang Wang and Fuli
                                                                         Feng and Meng Wang & Xiangnan He, Bias and Debias
6. Conclusion and Future work                                            in Recommender System: A Survey and Future Directions.
                                                                         ACM Trans. Inf. Syst. 41, 3, Article 67 (2023)
Implementing more accurate and less biased models is cru-            [3] Harald Steck, Training and testing of recommender sys-
cial to avoiding the perpetuation of negative feedback loops             tems on data missing not at random. In Proceedings of
and the overexposure of certain items caused by segmen-                  the 16th ACM SIGKDD international conference on
tation heuristics in retraining data. This approach also                 Knowledge discovery and data mining (KDD 2010). As-
enhances data quality, which is essential for fine-tuning                sociation for Computing Machinery, New York, NY,
models. A recommender system that diversifies content                    USA, 713–722
exposure improves user experience by ensuring that vis-              [4] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh,
ibility is not limited to only the most popular items. In                Navin Chandak, and Thorsten Joachims, Recommenda-
our experiments, Inverse Propensity Scoring (IPS) stands                 tions as Treatments: Debiasing Learning and Evaluation.
out for its simplicity and model-agnostic nature, requiring              In Proceedings of the 33rd International Conference on
no randomized data collection and fewer training epochs.                 International Conference on Machine Learning - Vol-
However, the improvements it offers are somewhat limited.                ume 48 (ICML 2016). JMLR.org, 1670–1679.
AutoDebias excels in improving accuracy metrics, but at              [5] Quanyu Dai, Haoxuan Li, Peng Wu, Zhenhua Dong,
substantially higher computational costs and sometimes                   Xiao-Hua Zhou, Rui Zhang, Rui Zhang, and Jie Sun, A
poorer performance in Gini and Entropy. DR still offers                  Generalized Doubly Robust Learning Framework for Debi-
strong improvement in observed metrics, including Gini                   asing Post-Click Conversion Rate Prediction. In Proceed-
and Entropy. So while each debiasing method has its own                  ings of the 28th ACM SIGKDD Conference on Knowl-
trade-offs, significant performance gains still depend on the            edge Discovery and Data Mining (KDD 2022). Associ-
challenging task of collecting randomized datasets, as high-             ation for Computing Machinery, New York, NY, USA,
lighted in our introduction. Potential future work includes:             252–262.
(1) adopting online reinforcement learning approach such             [6] Jiawei Chen, Hande Dong, Yang Qiu, Xiangnan He, Xin
as Multi-Armed Bandit (MAB) [14, 15, 16] for data collec-                Xin, Liang Chen, Guli Lin, and Keping Yang, AutoDebias:
tion, including contextual bandit models, (2) developing                 Learning to Debias for Recommendation. In Proceedings
and testing combined debiasing models which can com-                     of the 44th International ACM SIGIR Conference on
bine strengths of different debiasing techniques to mitigate             Research and Development in Information Retrieval
various biases simultaneously while optimizing for compu-                (SIGIR 2021). Association for Computing Machinery,
tational efficiency.                                                     New York, NY, USA, 21–30.
                                                                     [7] Andrés Villa, Vladimir Araujo, Francisca Cattan, and
    Denis Parra, Interpretable Contextual Team-aware Item
    Recommendation: Application in Multiplayer Online Bat-
    tle Arena Games. In Proceedings of the 14th ACM Con-
    ference on Recommender Systems (RecSys 2020)
[8] Qilin Deng, Kai Wang, Minghao Zhao, Zhene Zou,
    Runze Wu, Jianrong Tao, Changjie Fan, and Liang Chen
    Personalized Bundle Recommendation in Online Games.
    In Proceedings of the 29th ACM International Confer-
    ence on Information & Knowledge Management (CIKM
    2020)
[9] Meng Wu, John Kolen, Navid Aghdaie, and Kazi A.
    Zaman. Recommendation Applications and Systems at
    Electronic Arts. In Proceedings of the Eleventh ACM
    Conference on Recommender Systems (RecSys 2017)
[10] Cao, Lele & Asadi, Sahar & Biasielli, Matteo & Sjöberg,
    Michael. Debiasing Few-Shot Recommendation in Mo-
    bile Games. Workshop of ACM Conference on Recom-
    mender Systems (RecSys 2020)
[11] Liu, Dugang and Cheng, Pengxiang and Dong, Zhen-
    hua and He, Xiuqiang and Pan, Weike and Ming, Zhong
    A general knowledge distillation framework for counter-
    factual recommendation via uniform data. Proceedings
    of the 43rd International ACM SIGIR Conference on
    Research and Development in Information Retrieval
    (SIGIR 2020)
[12] Google. 2024, Vertex AI. Retrieved December 1, 2023
    from https://cloud.google.com/vertex-ai,
[13] Marlin, Benjamin M., and Richard S. Zemel. Collabo-
    rative prediction and ranking with non-random missing
    data. Proceedings of the third ACM conference on Rec-
    ommender systems (RecSys 2009)
[14] Felício, Crícia Z., Klérisson VR Paixão, Celia AZ Barce-
    los, and Philippe Preux. A multi-armed bandit model
    selection for cold-start user recommendation. In Proceed-
    ings of the 25th conference on user modeling, adapta-
    tion and personalization, pp. 32-40. 2017.
[15] Wang, Lu, Chengyu Wang, Keqiang Wang, and Xi-
    aofeng He. Biucb: A contextual bandit algorithm for
    cold-start and diversified recommendation. In 2017 IEEE
    International Conference on Big Knowledge (ICBK), pp.
    248-253. IEEE, 2017.
[16] Wang, Qing, Chunqiu Zeng, Wubai Zhou, Tao Li,
    S. Sitharama Iyengar, Larisa Shwartz, and Genady Ya
    Grabarnik. Online interactive collaborative filtering using
    multi-armed bandit with dependent arms. IEEE Trans-
    actions on Knowledge and Data Engineering 31, no. 8
    (2018): 1569-1580.

</pre>