=Paper= {{Paper |id=Vol-2715/paper4 |storemode=property |title=Debiasing Few-Shot Recommendation in Mobile Games |pdfUrl=https://ceur-ws.org/Vol-2715/paper4.pdf |volume=Vol-2715 |authors=Lele Cao,Sahar Asadi,Matteo Biasielli,Michael Sjöberg |dblpUrl=https://dblp.org/rec/conf/recsys/CaoABS20 }} ==Debiasing Few-Shot Recommendation in Mobile Games== https://ceur-ws.org/Vol-2715/paper4.pdf

Debiasing Few-Shot Recommendation in Mobile Games
Lele Cao Sahar Asadi
caolele@gmail.com sahar.asadi@king.com
AI R&D, King Digital Entertainment, AI R&D, King Digital Entertainment,
Activision Blizzard Group Activision Blizzard Group
Stockholm, Sweden Stockholm, Sweden

Matteo Biasielli Michael Sjöberg
matteo.biasielli@king.com michael.sjoberg@king.com
AI R&D, King Digital Entertainment, AI R&D, King Digital Entertainment,
Activision Blizzard Group Activision Blizzard Group
Stockholm, Sweden Stockholm, Sweden
ABSTRACT in e-commerce, the integration with mobile games is a relatively
Mobile gaming has become increasingly popular due to the grow- new area of research. Previous works have mostly focused on rec-
ing usage of smartphones in day to day life. In recent years, this ommending game titles to potential players (e.g., [2], [13], [14], [23],
advancement has led to an interest in the application of in-game and [24]). A few recent works have also explored in-game recom-
recommendation systems. However, the in-game recommendation mendation [3, 5, 8, 19], however, to the best of our knowledge the
is more challenging than common recommendation scenarios, such large-scale and real-time recommendation of in-game items has
as e-commerce, for a number of reasons: (1) the player behavior and not reached its maturity in the industrial scenarios. One of the
context change at a fast pace, (2) only a few items (few-shot) can be common business models in modern mobile games is free-to-play
exposed, and (3) with an existing hand-crafted heuristic recommen- where the game can be played free of charge, and monetization oc-
dation, performing randomized explorations to collect data is not a curs through micro-transactions of additional content and in-game
business choice that is preferred by game stakeholders. To that end, items [1]. Therefore, in-game contents are continuously added to
we propose an end-to-end model called DFSNet (Debiasing Few- the game, which may easily overwhelm the players, causing an in-
Shot Network) that enables training an in-game recommender on crease in churn probability. In-game item recommendation systems
an imbalanced dataset that is biased by the existing heuristic policy. help to alleviate this problem by ranking items and selecting the
We experimentally evaluate the performance of DFSNet both in ones that are more relevant to players in order to improve player
an offline setup on a validation dataset and online in a real-time engagement.
serving environment, illustrating the correctness and effectiveness In-game recommender systems utilize user interaction data that
of the trained model. describes historical behavior and current context of individual play-
ers to expose each player the right item at the right time. However,
CCS CONCEPTS despite a few in-game recommendation trials [3, 5, 8, 19] evaluated
mostly in an offline and batch fashion, there have not been many
• Information systems → Recommender systems; • Comput-
successful industrial applications of online in-game recommenda-
ing methodologies → Neural networks.
tion systems. This is mainly attributed to three unique requirements
from mobile games:
KEYWORDS
In-game recommendation, debiasing, mobile game, feedback loop, (1) The recommendation is often calculated on remote servers
few-shot recommendation, A/B test and delivered to game clients in near-real-time with low la-
Reference Format: tency (e.g., within the range of 100 milliseconds). Because of
Lele Cao, Sahar Asadi, Matteo Biasielli, and Michael Sjöberg. 2020. Debiasing the fast-evolving game dynamics, the behavior of players and
Few-Shot Recommendation in Mobile Games. In 3rd Workshop on Online their context keep changing quickly; consequently, the rec-
Recommender Systems and User Modeling (ORSUM 2020), in conjunction with ommendations (calculated from behavior and context data)
the 14th ACM Conference on Recommender Systems, September 25th, 2020, become outdated easily. As a result, the optimal solution
Virtual Event, Brazil. should continuously perform recommendation calculation
and always deliver up-to-date prediction when item exposure
1 INTRODUCTION is triggered. That is why the offline batch recommendation
As smartphones expand the gaming market [21], mobile gaming has might only provide a suboptimal average policy.
become a significant segment of the video game industry. Although (2) In mobile games, the items to purchase or to play are usu-
recommendation systems such as [12] and [25] are widely adopted ally carefully crafted by game designers. To avoid distract-
ing the players with an overloaded small mobile screen, i.e.
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil only a minimal subset (e.g. as small as one to three items,
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). hence termed few-shot) of those items is displayed at each
exposure occasion. Therefore, the players’ experience and
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Cao, et al.

(a) LiveOps (dynamic content) (b) Daily gifts

Figure 1: Examples of scenarios where item recommendation can be applied in CCS (1a) and CCSS (1b).

Figure 2: Illustration of collecting features and label for one player and one exposure trigger.

behavior will be more sensitive to recommendations than in 2 THE PROPOSED APPROACH
e-commerce applications where a large number of items can There are many scenarios where in-game recommendations could
be displayed at a time, leading to a stronger direct feedback be applied. In Figure 1, we exemplify a couple of examples for two
loop [18]. of the King1 games: Candy Crush Soda Saga (CCSS) and Candy
(3) The carefully designed in-game items are often exposed to Crush Saga (CCS). We notice that some occasions allow only one
players following a pre-defined heuristic policy that contains item (a.k.a. one-shot) to be exposed at a time such as Figure 1b,
a set of hard-coded rules regulating the particular item(s) to while others (e.g., Figure 1a) can display a few more (a.k.a. few-shot)
be exposed to player group(s) with certain attributes (e.g., IF items. Items can have no values specified as shown in these two
a player has won more than z games in a day, THEN show item examples, or have values attached. To simplify the introduction of
A instead of item B). The recommendation models largely fall our method and experiments, we use the one-shot setup where only
into two main categories: Supervised Learning [10] and Rein- one item k with value vk can be recommended upon each trigger
forcement Learning [7], both of which work only when item of an exposure opportunity. We will show that our approach can be
exposure can be randomly explored. However, the existing easily applied to scenarios with few-shot exposure and items with
heuristic policy heavily biases the experience of the players no values. The overall optimization objective is to maximize the
and hence the dataset, which makes it extremely difficult to expected value of the potentially clicked items. In this section, we
train an unbiased model directly. Collection of randomized present a walk-through of our debiasing few-shot recommendation
data is not often trivial. In many cases, stakeholders prefer approach.
to continue working with reasonably good heuristics which
might not be optimal but avoid any potential business risks 2.1 Features and Label
caused by randomization.
Each sample in the dataset corresponds to a complete item exposure
Our literature survey (till the date when this paper is written) event triggered at time t for a player. As illustrated in Figure 2, we
shows that none of the related works [3, 5, 8, 19] managed to si- calculate the player features, noted as x ∈ RD , using historical data
multaneously address the three aforementioned challenges. The of the last N days before the time t. The D-dimensional features
contributions of this paper is threefold: (1) we propose a Debiasing fall into two categories: behavioral (e.g., the total number of game
Few-Shot Network (DFSNet) that enables training an in-game item rounds played) and contextual (e.g., the latest inventory status). In
recommender merely using heavily biased and imbalanced data, (2) addition, at time t, the exposed item k (following a heuristic policy)
we discuss an approach to benchmark the trained DFSNet offline is recorded. Within the time window that item k is exposed, we
and (3) we put the model live to recommend items in real-time, log if the player eventually clicks on it or not, which is treated as a
and demonstrate how to monitor, evaluate, interpret, and iterate
on DFSNet in a controlled A/B test framework. 1 https://king.com
Debiasing Few-Shot Recommendation in Mobile Games ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil

Labels Y Preference Loss: Lp + Confidence Loss: Lc = Total Loss: L
Exposed k
Items
feature filter Item 1
X X1 ^
X1 ~
X1 Item 1 1024-512-256-2 C1 c^1 Item Values
Players X 1024-512-256-2
^
Y1 y^1 Item 2 v=[v1, …, v5]
X X2 ^
(Feature)
X2 ~ Item 2 768-384-192-2 C2 c^2 c^k
X2 ^ ^ vk
512-256-128-2 Y2 y2 Item 3
X X3 ^
X3 ~
X3 Item 3 512-256-128-2 C3 c^3
y^3
^ Meta Ranking Ranked
256-128-64-2 Y3 Item 4 Module Items
X X4 ^
X4 ~ Item 4 128-64-32-2 C4 c^4 (with exploration)
X4
64-32-16-2
^
Y4 y^4 Item 5
X X5 ^
X5 ~ Item 5
^
96-48-24-2 C5 c^5 y^k
X5 ^
64-32-16-2 Y5 y5 Sample Confidence
Balancers Predictors
Sample Preference
Balancers Predictors Confidence Predictor Module
Preference Predictor Module

Figure 3: The architecture of DFSNet (best viewed in color). In the example shown here, we set K, the number of items, to 5.

binary label y ∈ {0, 1}2 . The raw dataset is extremely biased due to information embodied by negative samples is lost. We propose a
the presence of the existing heuristic policy, and it is imbalanced minority subsampling technique (cf. sample balancers in Figure 3)
concerning the label and distribution of exposed item types. to automatically balance Xk during training. We split Xk into two
sets Xk+ and Xk− , where Xk+ contains all Mk+ positive samples, Xk−
2.2 The End-To-End Model: DFSNet contains the Mk− negative samples, and Mk+ ≪ Mk− . We randomly
In this section, we propose to train a debiasing few-shot network, pick (without replacement) max (min(Mk+ , Mk− ), 1) samples from
DFSNet, to perform a few-shot in-game recommendation using Xk− and put them in a set X
H− . We construct the balanced mini-batch
k
only the heavily biased and imbalanced dataset. The goal of DFSNet subset XHk by
is to rank K items where the k-th item has a value vk for a player + +
Hk = X+ ∪ X Hk ∈ R[max(min(Mk , Mk ),1)+Mk ]×D .
−
H− , where X (1)
(that is represented by a D-dimensional feature vector x ∈ RD ), in X k k
order to maximize the expected click value. As shown in Figure 3, This minority subsampling balancer is conceptually similar to the
DFSNet consists of three modules: preference predictors, confidence negative sampling in [22] that enforces each mini-batch to contain
predictors, and meta ranking. The training is conducted in a mini- only one positive sample; therefore, our approach results in a far
batch fashion; the input is a matrix X = {x (m) }m=1M ∈ RM ×D , more balanced mini-batch. Similarly to negative sampling, in mi-
where M is the number of samples in each mini-batch and D is the nority subsampling, Xk must contain at least one sample of the
number of features in each sample. For the sake of conciseness, we minority class.
use the general terms x and X to denote any sample and mini-batch, The output of the Sample Balancer from the k-th branch (i.e. XHk
respectively. in Figure 3) is then fed to a preference predictor implemented with
a 4-layer Deep Neural Network (DNN) binary classifier. The ELU
2.2.1 Preference Predictors. For a player x, the preference predic-
(Exponential Linear Unit) activation function [9] is applied to all
tor module (cf. the red dashed bounding box in Figure 3) predicts
hidden layers except the last one, which is a softmax layer with
the probability ŷk that player x will click item k if this item is ex-
two neurons. Dropout could be applied to avoid overfitting, yet we
posed. During training, the mini-batch X is firstly divided into K
choose to empirically scale the first three layers of the k-th DNN
subsets (noted as X1 , . . . , XK ), so that the k-th subset Xk ∈ RMk ×D proportionally (from a base architecture 32-16-8) to the exposure
only contains the Mk players that were exposed to the k-th item.
ratio of the corresponding item: Mk / kK=1 Mk . The loss to optimize
P
As a result, each item k has its own architectural branch, which
the preference predictor module, Lp , is formulated as
sequentially propagates Xk through a sample balancer and a prefer-
ence predictor, and eventually yields the click/non-click probability K 
 M

1 X  1 X
Hk 
Yk ∈ RMk ×2 .
D Lp =
(m) (m)
−yk ∗ log ŷk  , (2)
Since the number of clicked items usually represents a small 2K M
 H
k m=1
1
k =1  
fraction of the entire exposed item set, there are far more negative
 
samples (y=[1,0]) than the positive ones (y=[0,1]) in Xk . In many where “∗” represents the element-wise multiplication,
Hk = max(min(M + , M − ), 1) + M + is the number of samples in X
M Hk ,
recommendation methods such as [16, 20], positive and negative k k k
(m)
samples are manually balanced by random sampling, and the rich the notation yk is the label (one-hot encoded vector) of the m-th
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Cao, et al.

(m)
sample in X
Hk , and ŷ
k
is the predicted probability vector for the level of item k for an individual player x. The meta ranking module
same sample. (cf. the right-most box in Figure 3) ranks items by calculating a
propensity score Rk for each item using three factors: ŷk , ĉ k , and
2.2.2 Confidence Predictors. To explicitly model the bias from the
vk ; the term vk is the value of item k, which is usually predefined.
pre-dominant heuristic, we introduce a confidence predictor module
We propose a piece-wise formula for computing Rk :
(cf. the green dashed bounding box in Figure 3) to DFSNet. The
v
confidence predictors estimate the probability c k that player x has k
 min(v)
 , (ĉ k ≥ 12 ) ∧ (ŷk ≥ 12 )
recently been exposed to the item k. Thus, ĉ k can be treated as an Rk = 
ĉ · ŷ · vk , (5)
 k k max(v) , otherwise
approximation of the confidence we have for the predicted click
probability ŷk . Similar to preference predictors, this module also where functions min(v) and max(v) respectively return the min-
employs K branches (for K items), each of which has a sample imum and maximum element from vector v = [v 1 , . . . , v K ]. Gen-
balancer and a DNN binary classifier. erally speaking, Rk is obtained by calibrating ŷk with ĉ k and vk ,
The mini-batch input X ∈ RM ×d is fed into the sample balancer so that random exploration data would not be mandatory (at least
of each branch indiscriminately. To prevent the confidence pre- initially). To the best of our knowledge, only [17] discussed the
dictors from simply memorizing the heuristic rule and lose the possibility of removing item position bias using an adversarial net-
generalization capability, it is important to remove the features (if work, yet our approach manages to deal with much stronger item
any) that are used in heuristic policy, hence X’s second dimension d exposure bias using a more explainable strategy; and explainabil-
may be smaller than the original dimension D. The sample balancer ity is valued highly in industrial environments [11]. If vk is not
in this module first divides X into two subsets Xk ∈ RMk ×d and available (Figure 1b and 1a), we can adapt Equation (5) to
X¬k ∈ R (M −Mk )×d , where Xk only contains Mk players exposed ŷk
 , (ĉ k ≥ 12 ) ∧ (ŷk ≥ 12 )
to item k, and X¬k has the rest M − Mk samples. Due to the pre- Rk = ĉ · ŷ . (6)
existing heuristic policy, the item exposure was not randomized,  k k , otherwise
making the size of Xk and X¬k imbalanced. To that end, we need a We can conveniently assume that K items are already sorted by
sample balancer for each branch to produce a balanced mini-batch their values v, hence the propensity scores R = [R 1 , . . . , R K ] are
Xk using also sorted accordingly. An overly drastic change of item exposure
(e.g. a player who used to see item 1 according to the heuristic
X¬k ∪ Xk′ ∈ R[max(M −Mk ,1)]×d

, Mk ≥ M − Mk
which suddenly gets item K from a newly deployed recommender

Xk =  , (3)

[max(M ,1)]×d

′
 Xk ∪ X¬k ∈ R
k , Mk < M − Mk system) may undermine the player experience and game ecosystem.
where Xk′ and X¬k ′ are obtained via minority subsampling (with- To avoid that situation, it is a good practice to enforce a heuristic
out replacement); specifically, the former term contains max(M − deviation threshold (noted as ks ∈ {1, . . . , K − 1}) in the online
Mk , 1) samples randomly selected from Xk , and the latter contains production environment. Specifically, we mask Rk with
max(Mk , 1) randomly picked samples from X¬k . M k denotes the
Hk =  0 , |kh − k | > ks

R , (7)

number of samples in Xk , hence Xk ∈ RM k ×d . R
k , otherwise
Xk is then fed to a confidence predictor implemented in the same

where kh is the item from the pre-existing heuristic policy. With
way as in preference predictors except that each DNN is scaled
proportionally to factor M k / kK=1 M k . The loss Lc to optimize this
P R = [R
H H1 , . . . , R
HK ], both one-shot and few-shot in-game recommen-
module has a similar form to Equation (2): dation are possible. When recommending items based on H R, we can
sometimes choose to apply ϵ-greedy to slowly accumulate more
K  Mk diversified data for follow-up model iterations.
 
1 X  1 X (m)

(m)

Lc = −ck ∗ log ĉk  , (4)
2K 
M k m=1
1
k =1 
 
 3 EXPERIMENTATION AND EVALUATION
We apply DFSNet to a real-time item recommendation scenario for
where ck ∈ {0, 1}2 is the constructed confidence label specifying
(m)

(m)
the CCSS game. There is a total of five items (K=5) in this scenario,
if the m-th player/sample in Xk actually saw item m or not, and ĉk yet only one item k can be shown on the mobile screen when the
is the predicted confidence probability vector for the same sample. player triggers the exposure event. The item k has a value vk . Items
The preference and confidence predictors are jointly optimized with are sorted by value in an ascending order, i.e., v 1 < v 2 < v 3 < v 4 <
a total loss L = Lp + Lc . v 5 . If a player clicks on the exposed item k, a value vk will be added
2.2.3 Meta Ranking. During model serving/prediction, the sample to the game ecosystem; and we choose to maximize the value of
the clicked item. The details of the concrete use case and items are
balancers will be omitted, meaning that the input X ∈ RM ×D
considered to be sensitive proprietary information and therefore
(representing M players) will be directly fed to DNNs in all branches,
anonymized in this paper.
in order to simultaneously generate real-valued preference (D Y∈
As illustrated in Figure 4, the raw dataset is collected (cf. Sec-
[0, 1]M ×K ×2 ) and confidence (DC ∈ [0, 1]M ×K ×2 ) predictions. The
tion 2.1) using a Flink2 based stateful streaming platform [6]. The
second values in the last dimension of D Y are the click probabilities, collected dataset contains approximately 22 million samples, each
while those of DC are the confidence levels expressed as probabilities. of which has D = 48 features. We apply different transformations
To simplify the discussion that follows, we will use ŷk and ĉ k to
denote, respectively, the predicted click probability and confidence 2 https://flink.apache.org
Debiasing Few-Shot Recommendation in Mobile Games ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil

Events Events Features Data (BigQuery)

Trained Model
Item Item Prediction
Tensorflow Machine Learning
Streaming Cluster
Game Clients Game Servers Serving Cluster Platform
(Flink-based)
Online (with A/B test capability) Offline

Figure 4: High-level system topology of offline model development and online model serving.

(e.g., min-max, z-score, and logarithmic) to numerical features and we illustrate, in Figure 6b, a Policy Transition Matrix (PTM), where
perform either one-hot encoding or embedding to categorical fea- each cell at position (i, j) indicates the ratio of players who were
tures. The dataset pre-processing and model development is carried supposed to get item j, according to heuristic policy, but are now
out on a machine learning platform developed by King. exposed to item i according to DFSNet. It can be seen that the
We will present both offline (training and evaluation) and online diagonal has the majority of the unchanged exposures, and each
(serving and monitoring) evaluation of DFSNet in the following row largely follows a truncated normal distribution.
sections. DFSNet is implemented in Tensorflow3 ; the preference and
3.1.3 Distribution of preference and confidence predictions. For
confidence DNNs are scaled as depicted in Figure 3. The training
each sample in the validation dataset, DFSNet produces ten proba-
is carried out with Adam optimizer [15], using 70,000 steps and a
bilities: five click probabilities (ŷ1 to ŷ5 ) and five confidence prob-
mini-batch size of 2,048. The learning rate is initialized to 5 × 10−3 ,
abilities (ĉ 1 to ĉ 5 ). Figure 7 visualizes ŷk and ĉ k jointly to answer
and then it exponentially decays to 2 × 10−6 . During serving, we
four questions:
set ks = 2 in Equation 7 to obtain one single item to recommend.
(1) Does ŷk reflect the low click ratio of item k? The five red
3.1 Offline Performance: Model Training and area plots on the diagonal are the distributions of ŷk , all of
which show that clicking tends to be a rare event.
Validation
(2) Does ĉ k match the exposure ratio of item k? The five green
To perform offline model evaluation, we create a validation dataset bar plots on the diagonal represent the distributions of ĉ k ;
U ) by randomly selecting 1% data from the
(noted as U = {x (u ) }u=1 the majority of exposures come from item 1, which coincides
raw dataset (thus U ≈ 0.22 million), and use the rest for training. with the heuristic item exposure distribution in Figure 6a.
3.1.1 Generalization evolvement during training. We evaluate the (3) Does ŷk show general item preference? The lower triangu-
performance of the current model on the validation dataset during lar portion has pair-wise scatter plots of click probabilities.
the training. Since the datasets are highly imbalanced, accuracy is Each data point in the plot for item i and j has a coordinate
not an informative metric to monitor during training. We also find of (ŷi , ŷ j ), thus if the point is below the line of ŷi =ŷ j , the
that recall and precision are having a hard time competing with each corresponding player prefers item i over j, and vice versa. To
other (showing no clear trend) during the training, hence not ideal examine the general trend, we fit linear models (red straight
for monitoring the training performance. AUC-ROC (Area Under lines going through the original points) for pair-wise plots.
the Curve of Receiver Operating Characteristics), on the other hand, We observe that in average, players prefer items with lower
is a stable metric that reliably tells how much the model is capable values.
of distinguishing between classes; therefore, the evolution of per- (4) Can DFSNet be confident with multiple items for the same
item AUC-ROCs (see Figure 5) indicates how the generalization player? The upper triangular portion in Figure 7 contains
ability of the model improves during the training process. At the pair-wise scatter plots of confidence probabilities ĉ k . Every
end of the training, we also measure the recall and precision for point in the plot for item i and j is located at (ĉ i , ĉ j ). In-
each item, which are visualized as red bars in Figure 12. tuitively, implied by Equation (5), the points (representing
players) in the green shaded areas are likely eligible to more
3.1.2 Policy change quantization: heuristic vs. DFSNet. We use the than one item.
trained DFSNet to obtain predictions on the validation dataset. We
3.1.4 Best-effort estimation of recall, precision, and uplifts. On the
first measure the overall change of item exposure distribution. The
results are reported in Figure 6a. In our experiment, we observed no offline validation dataset U ∈ RU ×D , it is impossible to measure
significant change in item allocation for players due to the strong the “quality” of a recommendation that is different than what was
confidence constraint imposed, yet there is a slight shift towards actually exposed; hence, a sub-optimal solution is to create a subset
the higher-valued items. The ratio of players that see a different (from U) containing only the players for whom both DFSNet and the
heuristic policy recommended the same item. We use U ′ ∈ RU ×D
′
item (than heuristic) is about 7.4%. To decompose the policy change,
(U ′ < U ) to denote that subset. On that subset, we measure per-
3 https://www.tensorflow.org item recall and precision for preference predictors (cf. the red
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Cao, et al.

(a) Item 1 (b) Item 2 (c) Item 3 (d) Item 4 (e) Item 5 (f) ROC for all items

(g) Item 1 (h) Item 2 (i) Item 3 (j) Item 4 (k) Item 5 (l) ROC for all items

Figure 5: Training performance of confidence predictors (a-f) and preference predictors (g-l). (a-e),(g-k) The AUC of ROC
measured on the validation dataset for each item during training. (f),(l) The item ROCs on the validation dataset at the end of
the training.

(a) Comparison of item distribution. (b) PTM.

Figure 6: Comparison of heuristic and DFSNet policy on the validation dataset: (6a) overall item exposure distribution and (6b)
policy transition matrix (PTM). Each row in PTM adds up to 1.0.

bars in Figure 12). To provide uplift baselines of average click rate For each game player in the test group, the prediction client makes
( #_clicked_items
#_items ) and click value ( total_value_of_clicked_items
#_clicked_items ), we a request to the DFSNet prediction service (in real-time) as soon
calculate both metrics for both the heuristic and DFSNet policies. as any pre-defined triggering event emerges. In the life cycle of a
The results are presented in Table 1. Offline uplifts will be then real-time recommendation system, it is often required to iterate on
compared with the ones obtained during online model serving (cf. the model serving periodically (cf. Figure 8 for an example of two
Section 3.2.3). serving iterations) to incorporate bug fixes or new models trained
on more recent data.
3.2 Online Performance: Real-Time Serving During online serving, we track several metrics (aggregated
and Monitoring into temporal windows of 5 minutes) to monitor the key system
performance, some examples of which include model response
After the DFSNet model is trained and validated in an offline envi-
time, model exceptions, and model raw output distribution. The
ronment, it is then deployed in a Tensorflow Serving4 cluster. As
definition of those system metrics remains the same for different
illustrated in Figure 4, a prediction client (sharing the entire feature
recommendation models. These metrics are indicators of the system
collection logic described in Section 2.1) is also deployed on the
health, and therefore, they play a critical role in the validity of
streaming cluster. To validate the online performance of the DFSNet
the model. To monitor model performance, we log all features,
model, we run an A/B test on a small fraction of players on CCSS.
predictions, and labels in a BigQuery5 database, and visualize them
4 https://www.tensorflow.org/tfx/guide/serving 5 https://cloud.google.com/bigquery
Debiasing Few-Shot Recommendation in Mobile Games ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil

Figure 7: The visualization of preference (red plots) and confidence (green plots) predictions.

Table 1: The estimated click rate and value obtained from the validation dataset. For the DFSNet policy, if item k (with a
predicted click probability ŷk ) is ranked highest, and if ŷk > 0.5, we assume the player will click the recommended item k,
generating a value vk .

Metrics (obtained on validation dataset) Heuristic Policy DFSNet Policy Uplift (%)
Average click rate 0.1022 0.1312 28 %
Average click value 1.52 1.92 26 %

in a dashboard that is updated on hourly basis. We will hereafter where the red curve shows the ratio of players (in the DFSNet A/B
emphasize our online evaluation on several key perspectives, all of test group) for which DFSNet and the heuristic policy recommended
which are adapted from the monitoring dashboard. different items. As reported in Section 3.1.2, this ratio is approxi-
mately 7.4% (represented by a straight green line) when measured
3.2.1 Heuristic deviation trend. The foremost questions to answer on the offline validation dataset. So, the expectation is that the
about model performance are twofold: (1) what the scale of the ratio of impact should reach around 7.4%; this trend can be clearly
model impact is and (2) how this impact evolves along the timeline. seen in each model serving iteration. However, some input features
To answer these two questions, we illustrate the overall heuristic need individual players to respond to certain game components,
deviation trend of two adjacent model serving iterations in Figure 8, which takes about three days in the use case discussed here; and
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Cao, et al.

different bundle than heuristic 0.10
The ratio of players that get a
Online
0.08 0.074 Offline

0.06

0.04
0.02
Timeline
0.00
(Days)
1

27
ay

ay
D

D
Previous model serving Current model serving

Figure 8: The ratio of players that see different items from the heuristic policy: a span of 27 days containing two model serving
iterations.

items are served using the heuristic policy to players that still have expectation. In practice, it is acceptable to have a few red
incomplete features. As a result, for each model serving iteration, boxes as long as most of the densely populated cells satisfy
the ratio always starts from a fairly low point before reaching 7.4%. the expectation.
Furthermore, between two subsequent iterations, the ratio drops, (d) The PTM in Figure 9d principally serves the same purpose
in this use case, for three days while rebuilding features before as Figure 9c except that it computes the average click value
picking up the ascending trend again. We believe that the daily of item i instead.
monitoring of the heuristic deviation trend helps tracking the scale Based on the PTM analysis, we expect to see an improvement in
of the model impact effectively. the user engagement with the examined game feature compared
3.2.2 Player Transition Matrix (PTM). To have a better insight of to using the heuristic solution. Further analysis of the PTM can
the underlying changes contributing to the overall model impact, help us to understand better user behaviors. In our analysis, we
we break down the impact analysis further using PTMs. Here, a PTM used an eight-dimensional space to describe players behavior. Fig-
is a 2D matrix that puts players into each grid cell according to how ure 10 shows different user behavioral patterns (in the form of radar
their experience changed from the default heuristic policy (rows) charts) on top of the grayscale background from Figure 9a. The
to the model policy (columns). Figure 9 summarizes the results over KPI calculation and the actual values are considered to be sensitive
a period of 14 days since Day11 (cf. Figure 8) and illustrates four proprietary data and therefore removed from the charts.
different PTMs (a-d) that enable model impact decomposition from 3.2.3 Uplifts of click ratio, count, and value. The previous sections
four different perspectives: presented a drill-down process of analyzing the model behavior
(a) Grid cell (i, j) denotes the number of players that are now in comparison with the heuristic policy. We now zoom out and
exposed to the item i but would have originally got item j. compare the item click dynamics with the control group. We choose
The gray-scale background of Figure 9a is also used in (c) to focus on the accumulated 14-day uplift of three metrics: click
and (d). count, click ratio, and click value (these metrics are defined in
(b) Calculated by dividing each number in (a) by the sum of the Section 3.1.4). The online uplift is computed by subtracting the
corresponding row. It should have the highest ratio values metrics (normalized by the population size of A/B test groups) of
on the diagonal, as obtained in the offline evaluation and the control group from the ones of the DFSNet group. Hence, uplift
reported in Figure 6b. Because the presence of incomplete can have both positive and negative values. All uplift values are
features leads to a policy fallback (to the heuristic, as men- considered sensitive proprietary data and therefore scaled.
tioned in Section 3.2.1), the results presented in Figure 9b Figure 11a shows that DFSNet group is losing clicking counts
are more conservative than those shown in Figure 6b. on items 1 and 4 while gaining more click counts on other items;
(c) To inspect how the moved players impact the click prob- therefore, the players moved away from those buckets are mostly
ability on PTM, we calculate the click percentage (of item item clickers, and they bring more absolute click counts for items 2,
i) for the player cohort in each grid cell, and then over- 3, and 5. However, the click ratio of item 5 is reversed in Figure 11b,
lay the percentage values on top of Figure 9a’s gray-scale which is a consequence of the lower click ratio in the player cohort
background, resulting in Figure 9c. The blue values in the moved to item 5. Figure 11c shows the uplift of accumulated click
diagonal cells are the click percentage for the control group. value for each item. We observe that the loss of click value (from
Each off-diagonal cell (i, j)i,j contains the players that are items 1 and 4) is compensated by the increased click value for items
moved from cell (j, j) to cell (i, i). Intuitively, we expect the 2, 3, and 5, leading to a net positive value uplift (approximately
model to guarantee the click percentage in (i, j)i,j to be +0.71% over the control group). As a result, the offline uplift esti-
larger than either percentage values in cell (i, i) or (j, j); we mations (Table 1) are overly optimistic compared to that measured
use red boxes to highlight the cells that fail to satisfy that online.
Debiasing Few-Shot Recommendation in Mobile Games ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil

(a) Number of players “moved” from j to i. (b) Ratio version of (a): each row sums to 1.

(c) Click percentage of item i for players who should see item j ac- (d) Average click value of item i for players who should see item j
cording to the heuristic. according to the heuristic.

Figure 9: The player transition matrices: each grid cell at coordinate (i, j) indicates (9a-9b) the volume, (9c) average click ratio,
and (9d) average click value for players who would have been exposed to item j according to the heuristic policy and got
recommended item i from DFSNet. The blue numbers along the diagonal in 9c, 9d denote the corresponding statistics obtained
from the control group in A/B test.

3.2.4 Iterating on the DFSNet model. The dataset used to train DF- reference initially, hence we argue that the offline evaluation tends
SNet is highly biased due to the pre-existing heuristic rules. Our to underestimate the true values of precision and recall. Figure 12c
approach achieves debiasing by incorporating confidence predic- and 12d reflect the situation four days later. The trend is clear:
tors, thus demonstrating a mild impact of less than 8% (cf. Figure 8). in four days, both precision and recall have declined significantly;
Nonetheless, that impact continuously changes the players’ experi- when the majority of green bars go under the red bars, it is probably
ences, which nudges the input feature distributions around. This the time to retrain/finetune the DFSNet using fresher data.
creates a direct feedback loop, which gradually compromises the gen-
eralization and discriminative capability of the model being served. 4 CONCLUSION AND PERSPECTIVES
It is a form of analysis debt [18], in which it becomes increasingly
In-game recommendation aims to enable providing more relevant
difficult to predict the behavior of a given model before it is released.
items to each player. The in-game recommendation use cases usu-
Iterating the model periodically using more recently collected data
ally allow exposing only a few items at a time; thus, change in
can reduce the intensity of the feedback loop. However, we need a
the choice of items can have a large impact on game dynamics
metric to determine the time to train a new model. Accuracy is not
leading to a short feedback loop. In addition, player preferences
an option since the logged data is heavily imbalanced (click event
change quickly due to the change in the game dynamics and player
is rare), and we care more about correctly predicting the clicking
context. As a result, the model gets outdated sooner in real-time
events. Practically, AUC-ROC (cf. Section 3.1) and response distri-
prediction. In-game item exposures are mostly dominated by some
bution charts [4] can also be used to monitor the feedback loop,
hand-crafted heuristics, which heavily bias the data, and random-
yet they are not as sensitive as precision and recall. We propose to
ized exploration to train an unbiased recommendation model is
monitor the precision and recall of preference predictors (i.e., the
usually not favored by stakeholders. We propose DFSNet that en-
green bars in Figure 12) to identify the “right” moment for model
ables training an unbiased few-shot recommender using only the
iteration.
biased and imbalanced data.
In Figure 12, the red bars represent the precision and recall
During training, AUC-ROC is a stable indicator of the modelâĂŹs
estimated using a subset of the validation dataset (as explained
generalization ability. We demonstrate several ways to estimate the
in Section 3.1.4), while the green bars in Figure 12a and 12b are
model performance offline on a validation dataset. We also evalu-
respectively precision and recall calculated 14 days after the model
ate the online DFSNet performance in an A/B test. We start with
got deployed (a snapshot on Day24 in Figure 8). We observe that
monitoring the overall model impact by looking at the heuristic de-
online precision and recall reach much higher values than the offline
viation trend. Then, we further decompose the model impact using
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Cao, et al.

Figure 10: The overlay of player behavioral features (in an eight dimensional space visualized in radar charts) over the policy
transition matrix; each player dimension is calculated over a period of the past N days and then rescaled to the range of [0,1].

(a) Uplift of click counts. (b) Uplift of click ratio. (c) Uplift of click value.

Figure 11: The accumulative scaled uplift (DFSNet over control A/B test group) of (11a) click count, (11b) click ratio, and (11c)
click value per exposed item type.

PTMs. We carried out data analysis to understand user behaviors and proxy metrics as a way to have an estimate of model online
and discern the key factors causing players to be exposed to a differ- performance. We discuss and showcase the challenges of an online
ent item than the heuristic recommendation. This work proposes solution in an A/B test. The comparison between the control and
a solution to address the problem of bias and imbalanced data in DFSNet test groups show a net +0.71% uplift of click value, which is
the domain of in-game recommender systems. We suggest offline less optimistic than the best-effort offline estimation. We show that
Debiasing Few-Shot Recommendation in Mobile Games ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil

(a) Precision of click prediction up till Day24. (b) Recall of click prediction up till Day24.

Figure 12: The online precision and recall of DFSNet preference predictions compared to the offline best-effort estimates.

continuous comparison of offline and online precision/recall can Mining (Melbourne VIC, Australia) (WSDM ’19). Association for Computing Ma-
help determine the appropriate time to retrain the model. Further chinery, New York, NY, USA, 456–464. https://doi.org/10.1145/3289600.3290999
[8] Zhengxing Chen, Christopher Amato, Truong-Huy D Nguyen, Seth Cooper,
analysis is required before putting the proposed solution live. In Yizhou Sun, and Magy Seif El-Nasr. 2018. Q-deckrec: A fast deck recommendation
the scenario presented in this paper, we chose the click-through system for collectible card games. In 2018 IEEE Conference on Computational
Intelligence and Games (CIG). IEEE, IEEE, Maastricht, The Netherlands, 1–8.
rate as one of the evaluation metrics, which is widely adopted in [9] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast
e-commerce. However, this metric might not be a good proxy for and Accurate Deep Network Learning by Exponential Linear Units (ELUs).
business metrics for an in-game recommender system. Future work arXiv:1511.07289 [cs.LG]
[10] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks
will explore the choice of metrics that constitute a better proxy of for YouTube Recommendations. In Proceedings of the 10th ACM Conference on
the model’s online performance. Recommender Systems (Boston, Massachusetts, USA) (RecSys ’16). Association for
In addition, future work includes (1) designing long-term labels Computing Machinery, New York, NY, USA, 191–198. https://doi.org/10.1145/
2959100.2959190
that better approximate the business targets, (2) explicitly modeling [11] Krishna Gade, Sahin Cem Geyik, Krishnaram Kenthapadi, Varun Mithal, and
the interactions between different in-game features to eliminate Ankur Taly. 2019. Explainable AI in industry. In Proceedings of the 25th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM,
the implicit feedback loop, and (3) replacing model iteration with ACM, New York, NY, USA, 3203–3204.
online reinforcement learning approaches. [12] Mihajlo Grbovic and Haibin Cheng. 2018. Real-time personalization using em-
beddings for search ranking at Airbnb. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining. Association for
Computing Machinery, New York, NY, USA, 311–320.
REFERENCES [13] Rama Hannula, Aapo Nikkilä, and Kostas Stefanidis. 2019. GameRecs: Video
Games Group Recommendations. In Welzer T. et al. (eds) New Trends in Databases
[1] Kati Alha, Elina Koskinen, Janne Paavilainen, Juho Hamari, and Jani Kinnunen. and Information Systems. ADBIS. Springer, Cham, Switzerland.
2014. Free-to-play games: Professionals’ perspectives, In DiGRA Nordic: Pro- [14] JaeWon Kim, JeongA Wi, SooJin Jang, and YoungBin Kim. 2020. Sequential
ceedings of the 2014 International DiGRA Nordic Conference. Proceedings of Recommendations on Board-Game Platforms. Symmetry 12, 2 (2020), 210.
nordic DiGRA 11, 1–14. [15] Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic opti-
[2] S. M. Anwar, T. Shahzad, Z. Sattar, R. Khan, and M. Majid. 2017. A game recom- mization. In 3rd International Conference on Learning Representations, ICLR 2015,
mender system using collaborative filtering (GAMBIT). In 2017 14th International Yoshua Bengio and Yann LeCun (Eds.). San Diego, CA, USA, 13 pages.
Bhurban Conference on Applied Sciences and Technology (IBCAST). 328–332. [16] Jianxun Lian, Fuzheng Zhang, Xing Xie, and Guangzhong Sun. 2018. To-
[3] Vladimir Araujo, Felipe Rios, and Denis Parra. 2019. Data mining for item wards better representation learning for personalized news recommendation:
recommendation in MOBA games. In Proceedings of the 13th ACM Conference on a multi-channel deep fusion approach. In Proceedings of the Twenty-Seventh
Recommender Systems. Association for Computing Machinery, New York, NY, International Joint Conference on Artificial Intelligence, IJCAI-18. International
USA, 393–397. Joint Conferences on Artificial Intelligence Organization, 3805–3811. https:
[4] Lucas Bernardi, Themistoklis Mavridis, and Pablo Estevez. 2019. 150 Successful //doi.org/10.24963/ijcai.2018/529
Machine Learning Models: 6 Lessons Learned at Booking.Com. In Proceedings of [17] John Moore, Joel Pfeiffer, Kai Wei, Rishabh Iyer, Denis Charles, Ran Gilad-
the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Bachrach, Levi Boyles, and Eren Manavoglu. 2018. Modeling and Simultaneously
Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, Removing Bias via Adversarial Neural Networks. arXiv:1804.06909 [cs.LG]
New York, NY, USA, 1743–1751. https://doi.org/10.1145/3292500.3330744 [18] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar
[5] P. Bertens, A. Guitart, P. P. Chen, and A. Perianez. 2018. A Machine-Learning Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Denni-
Item Recommendation System for Video Games. In 2018 IEEE Conference on son. 2015. Hidden Technical Debt in Machine Learning Systems. In Proceedings
Computational Intelligence and Games (CIG). IEEE, Maastricht, The Netherlands, of the 28th International Conference on Neural Information Processing Systems -
1–4. Volume 2. MIT Press, Cambridge, MA, USA, 2503–2511.
[6] Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas [19] Rafet Sifa, Raheel Yawar, Rajkumar Ramamurthy, Christian Bauckhage, and
Tzoumas. 2017. State management in Apache Flink®: consistent stateful dis- Kristian Kersting. 2020. Matrix-and Tensor Factorization for Game Content
tributed stream processing. Proceedings of the VLDB Endowment 10, 12 (2017), Recommendation. KI-Künstliche Intelligenz 34, 1 (2020), 57–67.
1718–1729. [20] Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. DKN: Deep
[7] Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H. knowledge-aware network for news recommendation. In Proceedings of the 2018
Chi. 2019. Top-K Off-Policy Correction for a REINFORCE Recommender System. world wide web conference. ACM, New York, NY, USA, 1835–1844.
In Proceedings of the Twelfth ACM International Conference on Web Search and Data
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Cao, et al.

[21] Robert Williams. 2020. Mobile games sparked 60% of 2019 global 157–168.
game revenue, study finds. Mobile Marketer. Retrieved January 2, [24] Hsin-Chang Yang, Cathy S Lin, Zi-Rui Huang, and Tsung-Hsing Tsai. 2017. Text
2020 from https://www.mobilemarketer.com/news/mobile-games-sparked-60- mining on player personality for game recommendation. In Proceedings of the
of-2019-global-game-revenue-study-finds/569658/ 4th Multidisciplinary International Social Networks Conference. Association for
[22] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang, and Computing Machinery, New York, NY, USA, 1–6.
Xing Xie. 2019. NPA: Neural news recommendation with personalized attention. [25] Chang Zhou, Jinze Bai, Junshuai Song, Xiaofei Liu, Zhengchao Zhao, Xiusi Chen,
In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge and Jun Gao. 2018. Atrank: An attention-based user behavior modeling frame-
Discovery & Data Mining. ACM, New York, NY, USA, 2576–2584. work for recommendation. In Thirty-Second AAAI Conference on Artificial Intelli-
[23] Hsin-Chang Yang and Zi-Rui Huang. 2019. Mining personality traits from social gence. AAAI Press, New Orleans, Louisiana, USA, 4564–4571.
messages for game recommender systems. Knowledge-Based Systems 165 (2019),