<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Churn Prediction for Game Industry Based on Cohort Classi cation Ensemble</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Evgenii Tsymbalov</string-name>
          <email>etsymbalov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Webgames</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>94</fpage>
      <lpage>100</lpage>
      <abstract>
        <p>In this paper, we present a cohort-based classi cation approach to the churn prediction for social on-line games. The original metric is proposed and tested on real data showing a good increase in revenue by churn preventing. The core of the approach contains such components as tree-based ensemble classi ers and threshold optimization by decision boundary.</p>
      </abstract>
      <kwd-group>
        <kwd>Churn prediction</kwd>
        <kwd>ensemble classi cation</kwd>
        <kwd>cohort-based prediction</kwd>
        <kwd>on-line games</kwd>
        <kwd>game analytics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The churn prediction is a real problem, which can be found in businesses that deal
with a permanent stream of customers using subscription services: banking [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ],
telecommunication [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and entertainment industries, with increasing popularity
of game analytics (e.g. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). The focus of business development shifts from the
attraction of new customers to retention of the old ones. Therefore modeling
users' out ow can be used to plan company's tactics and strategies. However,
it is often important to not simply know the out ow indicators on a macro
level, but to predict the churn probability for every customer to use the spot
interventions.
      </p>
      <p>The de nition of churn and the churners varies depending on a speci c
problem. Here we de ne the churners as users who were absent for 30 days and more.
Moreover, in this research we are interested in users who were absent for at least
three days and made at least one transaction. Such restrictions are motivated by
the low revenue of the late returners and recommendations from social platforms,
e.g., Facebook does not recommend to send out noti cations to the players with
more than 28 days of absence.</p>
      <p>In this paper, we propose a cohort-based classi cation approach to the churn
prediction for social on-line games. Our data processing pipeline includes feature
engineering and selection along with optimization of speci c metrics. Thus, we
introduce a meta-metric, which penalizes undesired outcomes of the cohort test
procedure; it is designed to re ect real-life experience of using the prediction
models. All the steps of the model training pipeline include optimization of
classi ers; the nal step optimizes the whole ensemble on meta-metric values.</p>
      <p>In Section 2, we introduce our cohort-based ensemble method for churn
prediction. In Section 3, we provide the reader with the results of experimental
evaluation. Section 4 concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Method</title>
      <p>In a few words, the churn prediction problem can be stated as follows: given the
data about users, who recently stopped playing, predict whether a particular
user will abandon the game or not. This information is used for sending
appto-user noti cations to those users, who are likely to stop playing our game at
all. These noti cations also o er some in-game goods to the user, usually some
in-game currency, rare valuable items or discounts.
2.1</p>
      <sec id="sec-2-1">
        <title>Meta-metric</title>
        <p>This problem is rather di erent from the standard data mining classi cation
problem due to several evaluation requirements, which result in so-called \cohort
test". It is de ned as follows (see Fig. 1 for short description):
1. We take the cohort of players who had been absent for 3 days by the given
date D.
2. A model, that solves the classi cation problem for a given day of absence,
predicts for every player, whether or not he or she becomes a churner.
3. Those, who were predicted as churners (both true and false churners), are
eliminated from the next steps.
4. We take the cohort of players who had been absent have been absent for 4
days by the given date (D + 1) and had not been eliminated in the previous
part.
5. Steps 2, 3, 4 are repeated until the date (D + 29) or no more users left in
cohort.
6. We evaluate the meta-metric, using the information about successful and
unsuccessful predictions at each step as well as users left after Step 5 (if
there are any).</p>
        <p>Final challenges are connected with the result of the cohort test and its
in uence on the game balance:
{ We do not want to send prizes (bonuses) to people who are likely to return
by themselves, because it may ruin the game balance.
{ We want the players to be returned as soon as possible. The probability of
the player's future transactions as well as the average revenue dramatically
decreases with the growing number of days the player is absent for.
{ We assume that we should not send noti cations to the same player twice.</p>
        <p>To answer these requirements, the cohort-based meta-metric is introduced:
M M =</p>
        <p>X
day=4:::26
( day 4T Pday</p>
        <p>F Pday)
(T P27 + F P27); 0 &lt;
1
(1)</p>
        <p>This metric can be interpreted as a weighted number of returned players,
with the reward for early return and the penalty for type I error (marking a user
who will return as a churner). Note that di erent goods o ered to a user with
noti cations, as well as various social platforms and projects require di erent
parameters for the meta-metric.</p>
        <p>The question of determining coe cients ; ; for a given project should be
discussed at both levels, analytical and managerial. While could be estimated
by approximating the probability of a player being a payer after he or she returns,
and could be estimated by experts (like we do in Webgames now) or based on
the data from the previous experiments (which is unavailable for this research).
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Problem statement</title>
        <p>Let CdDate = (P1; : : : ; PNdate ) be data of D-th day cohort of players on date date,
where D = 4 : : : 26, Pj 2 RjF j D are all player's features up to day (D + date),
F is the set of player's features at a particular day.</p>
        <p>For every user in a cohort, our algorithm A makes a decision S: on which
day D = 4 : : : 26 send out the noti cation (or not to send, then D = 27):
A : (Cdjate)j=4:::26 7 ! S = S(D1; : : : ; DN );
(2)
(3)
Our goal is to nd the decision, which maximizes the meta-metric (1):
S = arg max M M ((Cdjate)j=4:::26; S0)</p>
        <p>S0
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Ensemble of classi ers</title>
        <p>To classify a user on a given day, we use the set of classi ers fC4; C5; : : : ; C26g,
one for the data for each day the user absents. So the classi er C4 is trained
on the users who have been absent for 3 days and evaluated on the 4-th day
of absence; the classi er C5 is trained on the users who have been absent for 4
days, etc. This results in total of 23 classi ers.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Optimization and greedy solution</title>
        <p>Since the main problem requires the complex elimination procedure, therefore it
seems to be complicated (if even possible) to optimize the meta-metric
analytically. Instead, we use a greedy approach, with models for solving Problem (3)
optimized by the cohort ensemble classi ers parameters in terms of ROC AUC
metric, and these classi ers thresholds (decision boundaries) are optimized by
the meta-metric.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>Data description</title>
        <p>We used data from the Ghost Tales project on Facebook during the period
11.2015 { 03.2016, with the last month as a test subset. Hereby the features
described below are mostly game-independent (for free-to-play games), so this
feature set could be used for other projects as well.</p>
        <p>Here we treat a player on his/her di erent days of absence as di erent players,
which allows us to increase the number of objects in the dataset by more than
10 times, resulting in 2.7M records in total.</p>
        <p>Typically, features could be divided into three main groups:
{ Personal features: year of birth, country, etc.
{ Behavioral features: the total number of active days for a player, the number
of days a player is absent, the number of completed quests, etc.
{ Transactional features (based on users payments).
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Data processing</title>
        <sec id="sec-3-2-1">
          <title>Our data processing includes the following steps:</title>
          <p>1. Feature engineering. We have calculated pairwise ratios between all the
features with the same measurement units (e.g. number of days with payments
divided by total days for a given player) and added some other feature
combinations based on expert knowledge.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>2. Deletion of low variance features.</title>
          <p>3. Selection of the best features based on F -test.</p>
          <p>
            The number of the best features selected may vary for di erent base
classiers. Here we used Extra Random Forest classi er [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] and logistic regression.
We have also tried to use Gradient Boosting algorithms (XGBoost [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] and
AdaBoost), but these classi ers shows similar or even worse results and do
not support e ective parallelization. We do not use algorithms based on
neural nets due to their long training time, which is undesirable for weekly
model retraining.
3.3
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Single classi er optimization</title>
        <p>We have performed 10-fold cross-validation procedure on the training set with
optimization by ROC AUC score. The parameters used during the grid search
are summarised below.</p>
        <p>The list of ensemble trees' parameters (taken according to the real-life
resource constraints):
{ number of trees in ensemble: 50 { 500;
{ tree depth: 5 { 20;
{ minimal sample split: 2 { 15.</p>
        <sec id="sec-3-3-1">
          <title>Logistic regression's parameters:</title>
          <p>{ penalty type: l1 or l2;
{ C (regularization constant): 0.1 { 300.</p>
          <p>The performance summary for selected values of the parameters of Extra
Random Forest classi er is given in Table 1.
The simplest method to continue optimization is to set decision boundaries
(thresholds) for all the classi ers to the same value (which can be found via
cross-validation, for example). However, it is easy to see that we should reduce
the decision boundaries of classi ers for several rst days, because the users,
who returned on the rst days, have more impact on the meta-metric. Since
23-parameter optimization (for every classi er) is a computationally exhaustive
problem, we reduced the number of parameters by considering the threshold as a
function of day. We optimized the thresholds for di erent types of such functions:
{ linear: threshold = A day + B;
{ exponential: threshold = A exp(B day);
{ quadratic: threshold = A day2 + B day + C (where A, B, and C are
constants)
using di erential evolution as a global optimization algorithm. However, the
optimization on various cohorts shows that exponential and quadratic
approximations tends to reduce to linear, see Table 2.</p>
          <p>The optimal parameters found by linear optimization results in 3-7% increase
of the meta-metric, compared to the optimized parameters for the constant
decision boundary, see Fig. 2.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this research, we have proposed a cohort-based ensemble model for the churn
prediction. The nal part of this model includes optimization of the original
meta-metric, which is designed to re ect the real-life experience in usage of
prediction models that rely on a cohort test; other steps are also constructed by
taking real-life resource constraints into account. Various numerical experiments
show the importance of the steps used in the model training pipeline.</p>
      <p>During this research, the algorithm for the weekly model evaluation has been
developed and implemented in the Webgames company. Mostly automated, this
algorithm requires human assistance only at the step of the meta-metric
parameters' choice.</p>
      <p>We assume it can be used in various areas for churn prediction. Thus, the
obtained results form the groundwork for model improvement and generalization
in other areas such as telecommunication and banking industries.</p>
      <p>Acknowledgment. We would like to thank Webgames' Analytics department
for the provided data and feedback. We are also grateful to Mehdi Kaytoue from
INSA-Lyon (France), and Evgeniy Sokolov and Peter Romov from Yandex Data
Factory (Moscow, Russia) for their advice. Last but not least we would like to
thank Dmitry Ignatov (HSE, Moscow) for his supervision.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rowe</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Credit card churn forecasting by logistic regression and decision tree</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>38</volume>
          (
          <issue>12</issue>
          ) (
          <year>2011</year>
          )
          <volume>15273</volume>
          {
          <fpage>15285</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mutanen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahola</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nousiainen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Customer churn prediction-a case study in retail banking</article-title>
          .
          <source>In: Proc. of ECML/PKDD Workshop on Practical Data Mining</source>
          . (
          <year>2006</year>
          )
          <volume>13</volume>
          {
          <fpage>19</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ahn</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          , Han,
          <string-name>
            <given-names>S.P.</given-names>
            ,
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.S.:</surname>
          </string-name>
          <article-title>Customer churn analysis: Churn determinants and mediation e ects of partial defection in the Korean mobile telecommunications service industry</article-title>
          .
          <source>Telecommunications policy</source>
          <volume>30</volume>
          (
          <issue>10</issue>
          ) (
          <year>2006</year>
          )
          <volume>552</volume>
          {
          <fpage>568</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bosc</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaytoue-Uberall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Rassi,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Boulicaut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Mining balanced sequential patterns in RTS games</article-title>
          .
          <source>In: ECAI 2014 - 21st European Conference on Arti cial Intelligence</source>
          ,
          <fpage>18</fpage>
          -22
          <source>August</source>
          <year>2014</year>
          , Prague, Czech Republic. (
          <year>2014</year>
          )
          <volume>975</volume>
          {
          <fpage>976</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Geurts</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ernst</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wehenkel</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Extremely randomized trees</article-title>
          .
          <source>Machine learning 63(1)</source>
          (
          <year>2006</year>
          )
          <volume>3</volume>
          {
          <fpage>42</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
          </string-name>
          , T.:
          <article-title>XGBoost: eXtreme Gradient Boosting</article-title>
          .
          <source>R package version 0</source>
          .4-
          <fpage>2</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>