User Modeling and Churn Prediction in
                                       Over-the-top Media Services
                 Vineeth Rakesh                                          Ajith Pudiyavitil∗                         Jaideep Chandrashekar
           Interdigital AI Lab, USA                                          Lowe’s, USA                             Interdigital AI Lab, USA
       vineeth.mohan@interdigital.com                                   ajithkp12@gmail.com                   jaideep.chandrashekar@interdigital.com

ABSTRACT                                                                               stops using the service. By examining high level data collected on
We address the problem of customer retention (churn) in applica-                       one such OTT hardware platform, we propose feature engineering
tions installed on over the top (OTT) streaming devices. In the first                  techniques for modeling user behavior and leverage these features
part of our work, we analyze various behavioral characteristics                        to develop application-level churn prediction models. Specifically,
of users that drive application usage. By examining a variety of                       given a user u who installs an application a at a given time on their
statistical measures, we answer the following questions: (1) how do                    OTT device, our model predicts whether u will be engaged (or not
users allocate time across various applications?, (2) how consistently                 engaged) with a after a particular time window.
do users engage with their devices? and (3) how likely are dormant                         Users may decide to abandon a streaming service for any number
users liable to becoming active again? In the second part, we leverage                 of reasons such as a limited time budget to consume content, an
these insights to design interpretable churn prediction models that                    increasing affinity for a different application or a lack of compelling
learn the latent characteristics of users by prioritizing the specifica-               new content. Yet another reason may be that the user experiences
tions of the users. Specifically, we propose the following models:                     more hardware faults (i.e. reboots, poor wifi, high memory us-
(1) Attention LSTM (ALSTM), where churn prediction is done using                       age, etc.) when a particular service is being used. In such cases,
a single level of attention by weighting on individual time frames                     the user perceives these faults as being caused by the application
(temporal-level attention) and (2) Neural Churn Prediction Model                       and quits out of frustration. The key observation here is that both
(NCPM), a more comprehensive model that uses two levels of atten-                      application-level and device-level behavior can influence user churn.
tions, one for measuring the temporality of each feature and another                   Consequently, it is critical to model these heterogeneous factors
to measure the influence across features (feature-level attention).                    along with temporally correlated features reflecting usage of differ-
Using a series of experiments, we show that our models provide                         ent applications to accurately predict churn. To achieve this, we first
good churn prediction accuracy with interpretable reasoning. We                        analyze a dataset that captures high level events on OTT devices
believe that the data analysis, feature engineering and modeling                       and examine the signals of application churn. These events could be
techniques presented in this work can help organizations better                        user-initiated (i.e. opening or closing an application, restarting the
understand the reason behind user churn on OTT devices.                                device, putting the device to sleep, etc.) or device-specific events
                                                                                       (i.e. automatic reboots, wifi drops, software/firmware updates, etc.)
Reference Format:
                                                                                       With this data, we examine questions such as: (1) how users allocate
Vineeth Rakesh, Ajith Pudiyavitil, and Jaideep Chandrashekar. 2020. User
Modeling and Churn Prediction in Over-the-top Media Services. In 3rd
                                                                                       time across applications on their OTT device, (2) how often and for
Workshop on Online Recommender Systems and User Modeling (ORSUM 2020),                 how long users engage with the device (specific applications), and
in conjunction with the 14th ACM Conference on Recommender Systems,                    (3) how long do users go dormant, and how likely is it for a dormant
September 25th, 2020, Virtual Event, Brazil.                                           user to become reactive. In the second part of our work, we leverage
                                                                                       these statistical insights to design interpretable models that are
1    INTRODUCTION                                                                      effective in predicting churn across a wide range of scenarios.
In recent years, users have increasingly taken to consuming stream-                        The naive approach to building a model that predicts churn
ing video services via applications (i.e. Netflix, Hulu, Youtube, etc.)                would be to fix an observation time window T , extract a number of
on so called over the top (OTT) platforms (i.e. AppleTV, Roku,                         features of interest |m| from this window and then deploy a suitable
Amazon FireTV, etc.). Given the very large (and still growing) num-                    classification algorithm that predicts whether a subscriber will quit
ber of streaming services, there is fierce competition to attract                      a service after a period of time T . While this is entirely viable, there
new customers, while maintaining customer satisfaction. Unfortu-                       are two important drawbacks. First, the data is inherently noisy
nately, there is significant cost to attracting new users; thus, service               and high-dimensional; OTT devices send out periodic device-level
providers are very invested in retaining end-users and keeping them                    and application level summaries (the start time and duration of an
engaged with their products. These customer retention efforts fo-                      application session), and events observed on the box. This results
cus on providing exclusive and engaging content, personalized                          in a feature space of size T × |m|. Second, there is significant inher-
recommendations and intuitive user interfaces. When such efforts                       ent temporal correlation in the data; if a user spends a significant
fail, the operators experience customer churn, wherein a subscriber                    amount of time inside an application on successive days, there is a
                                                                                       strong signal that he/she will engage with the same application on
∗ This work was done when the author was at Interdigital AI Lab.
                                                                                       the next day. Flattening data over the entire window T into a single
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil
                                                                                       representation vector will lead to this information being lost. An
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons   alternative, more principled approach, is to learn latent attributes
License Attribution 4.0 International (CC BY 4.0).
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                                                               Vineeth Rakesh, et al.


of the data using time-series models such as recurrent neural net-        AndroidTV based OTT devices deployed in homes and operated
works (RNN) [3] and use them as features for churn prediction. A          by a large provider. Each of these devices comes pre-installed with
potential issue with this approach is that the compressed latent          a set of applications; apart from these, users (u) can download
vector is inefficient at capturing all the necessary information that     others from a large catalog on the app store. We observed over
leads to churn. Furthermore, it is extremely difficult to interpret       3.7k distinct applications being used in the dataset. However, some
the results of vanilla RNNs.                                              devices and applications are used very sporadically and account
   In this paper, we propose a model that address the drawbacks of        for very little data. To remove these, we pre-processed the data
the more conventional approaches. First, we introduce Attention           to filter out devices that were active for less than 90 days in total,
LSTM (ALSTM), where we modify the neural machine transla-                 and removed applications that were used on fewer than 15 devices
tion (NMT) model [2] for churn prediction; ALSTM models the               in our population. This filtering resulted in 14,082 unique users
local attention. Second, we propose Neural Churn Prediction Model         (i.e., OTT devices) 1 and 462 unique apps; this dataset is denoted
(NCPM), which incorporates two levels of attention i.e., local and        by D. Note that the churn prediction model builds features over
global. ALSTM uses temporal-level attention (local attention) where       the lifetime of the application on the device; this starts from the
different (sub) observation windows contribute different weights          time the application was installed, to when the user is deemed to have
towards predicting churn. For example, in a particular week we            abandoned it. Unfortunately, for the pre-installed applications such
might observe the subscriber slowly starting to watch more and            as Netflix, YouTube and SlingTV there is no install date. One can
more content on a new application, and spending less and less time        exclude such apps from the data; however, this leads to removing a
in another in which they were previously engaged (and eventu-             significant number of users. This is because a large proportion of
ally abandon). Thus, features collected in this week might require        users tend to confine themselves to using the pre-installed apps (which
higher priority when compared to other time frames. NCPM on               also happens to be the popular streaming services). Alternatively,
the other hand is a more comprehensive model that captures each           having all users in a single bin could lead to some serious bias in
feature using a separate ALSTM. The individual ALSTMs are then            our modeling since for default apps, we do not have a clear signal
combined with a feature-level attention layer (global attention).         on when it was downloaded. The user might have been using the
Global attentions are much better at prioritizing weights across          app well before the onset of our study. Therefore, besides D, we
different temporal features. For example, churning could be more          create a separate dataset De that completely excludes the default
influenced by device-level issues such as periodic reboots rather         apps, while for (u, a) ∈ D that do not have a start date, we simply
than application engagement. Although attention-based RNNs have           take the first log entry of a by u as a proxy for the actual install
been extensively used in the field of natural language processing         date. The subset De covers 8,223 unique devices using 397 distinct
[7, 25], they have very rarely been applied to the problem of churn       applications.
prediction. To our knowledge, the only work that appears to ad-               Clearly, the earlier a service provider is able to predict churn
dress this area is [26]; however, the attention mechanism used in         (of u), the more effectively they can take action and address the
their work is quite different than ours. We summarize the major           underlying reasons as to why a customer might be departing. Con-
contributions of our work as follows:                                     sequently, we divide D and D e into different days of activities A,
• Understanding user behavior: Through data engineering and               where A = {t |t ≤ T ,T ∈ {5, 10, 15, 20}}. For instance, T = 5 means
  statistical analysis, we provide several insights that explain the      we consider a maximum of five days of user activity to predict
  behavior of users in our OTT dataset.                                   the churn. Table 1 shows the characteristics of dataset D and De
• Predicting Churn: We propose attention-based RNN models                 across different activity days. In the upcoming section, we explain
  that learn the characteristics of users in a weighted low-dimensional   the methodology of determining the churners and non-churners
  latent space.                                                           (i.e., columns four and five in Table 1). Here, one can see that as T
• State-of-the-art performance: By conducting extensive exper-            increases, the number of users decrease. This is because the number
  iments on a real-world dataset, we show that NCPM outperforms           of users that continuously use the OTT for say 20 days is far less
  all other models over different test cases and achieves an accu-        than than those who use for just 5 days.
  racy of upto 89% and AUC of 92%. Additionally, NCPM interprets
  the reason for churning by emphasizing on features such as inter-       3    ANALYZING USER CHARACTERISTICS
  arrival time between the apps and consistency in app usage.             App and Device Usage: Figure 1 (a) shows the ten most popular
   We begin by introducing our dataset in Section 2 and in Section 3,     apps seen in our dataset, based on the number of users that regularly
we model the behavior of OTT customers as observed in this dataset.       use them. Sling Tv, Netflix, Youtube and Google games occupy the
The churn prediction models ALSTM and NCPM are proposed in                top four spots. Figure 1 (b) plots the distribution of daily time spent
Section 4 followed by the results of our experiments in Section 5.        inside each of the applications. We find that users spend 3-4 hours
Finally, we review related work in Section 6 and conclude our paper       on average with the OTT device, with a very small fraction of
in Section 7.                                                             users that spend more than 8 hours. In fig. 1 (c), we break this daily
                                                                          spend into four different parts of a day, corrsponding to morning
                                                                          (5am-12pm), afternoon (12pm-5pm), evening (5pm-10pm) and night
2    DATASET
                                                                          (10pm-5am). Unsurprisingly, we observe that users tend to spend
Our dataset consists of high level application session data (start        more time in the evening when compared to other periods of a
time and durations) and device level events that spans from Sept
2018 to April 2019. The data was collected from a sample of 31k           1 we use the words devices and users interchangeably
User Modeling and Churn Prediction in
Over-the-top Media Services                                                           ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil


                                Dataset D                                                          Dataset De
  T       #Users           #Apps #Churns                 #NonChurns           #Users          #Apps #Churns                     #NonChurns
  5       13082            402    9017                   16044                6390            283    8712                       9421
  10      7954             347    6339                   8596                 5500            234    4661                       7538
  15      5440             256    4123                   5932                 4929            194    3088                       6423
  20      5009             179    3193                   4067                 4453            165    2231                       5548
Table 1: Statistics of the churn prediction datasets D and De for different range of activity days T . For example, T = 5 implies
for a given user-app (u, a) tuple, u actively used a upto 5 days before churning.


day (median of 2.2 hrs). However, it is not significantly higher            explained by the plot, these users start using the device heavily at
than afternoon, which has a median of 1.8 hrs and morning with              first, but then stop using the box for various reasons. Please note
a median of 1.4 hrs. Another key statistic of interest is the inter         that since y-axis is the CDF, the flat line here indicates minimal or
arrival time between application sessions. Note that there may not          no activity (also indicated by very low standard deviation, since
be an explicit indicator of a user quitting an application. Very often,     there is no activity). (4) consistent users (d), finally, these are users
users just stop using the application that is installed, or deactivate      who are consistent and regularly use their OTT boxes to watch
their accounts but keep the application installed. Thus, churn must         different shows.
be detected implicitly, i.e., by the fact of the application not being      Understanding Churning Behavior: Before introducing our pre-
started by the user for a sufficiently long time. We calculate the          diction models, we briefly explain how we label a user (or device)
arrival time between successive start times of an application session       having churned, i.e., left the service. This is fundamentally a difficult
on a device (across all applications) and plot the maximum values,          task because there is no explicit signal for this behavior. Further
across all the devices, in fig. 1 (d). We observe that users, after an      complicating things, (a) some users don’t use the app for a few
absence, return to the OTT applications within a median time of             days, but return back after a brief period of inactivity, and (b) some
6 days (100-200 hours). The 75%-ile value of this distribution is           users simply download the app once (or spend a brief amount of
about 10 days. Later in this section, we leverage this to establish an      time in it) and never use it again. Figure 3 depics, at a high level,
inactivity threshold when we define churn more precisely.                   all the information for a user (u) application (a). The dotted lines at
User Engagement Patterns: Here we try to understand how users               either end capture the time data was collected and each of the green
spend time on their OTT device. We wish to explore the following            vertical lines in the middle indicate the start of application sessions
aspects: (a) Are there users who consistently use the box for same          (a 1 is the first session, an is the last). Here, we see that the applica-
number of hours every day? (b) Are there dormant users who don’t            tion was downloaded after the start of the data collection and used
use their device for a while, but then reactivate it? (c) Are there         several times, the last instance is at t 3 . Somewhat infrequently, we
users that engage with their device, but only intermittently and for        see the device itself disappears from the dataset; we consider this a
brief periods of time? To answer these, we carry out the following          signal that the user has disconnected the device and is no longer
analysis. First, for each day that our dataset spans, we compute the        using it. In this scenario, we capture this event having occured at t 4 .
cumulative time (in terms of cummulative distribution function -            With this depiction, we can now define churn in very specific terms
CDF) that the user engaged with the device. Specifically, we com-           by addressing the two challenges previously discussed. First, we
pute (i, c i ) for each user, where i = 1, 2, .., |D| represents each day   require that the application not be used for a period of time after the
in our dataset, and c i is the total number of hours spent on the           last use. Following the example in fig. 3, we impose the condition
device upto day i. Next, we carry out a non-linear fit on this data         ∆2 ≥ T3q . Here T3q – an inactivity threshold – is the 3rd quartile
for each user, recording the learned slope, intercept and standard          of the distribution in fig. 1 (d) and turns out to be 10 days. Second,
error as derived features for each user. Subsequently, we cluster           we require a minimum number of sessions to be recorded for an
the derived features using K-means; the number of clusters is de-           application and user. Specifically, u should have engaged with a at
termined based on the silhouette score [23]. This analysis yields           least K times, i.e., n ≥ K and we set K = 3. Figure 4 illustrates the
four main behavior patterns that cover the vast majority of users,          characteristics of devices that are exclusively labeled as churn. We
and are depicted in fig. 2. Each plot is based on the original data of      see that the median inter-arrival times for the top 5 apps is around
cumulative device engagement time (x-axis is days elapsed, y-axis           30 days (Figure 4 (b)), which is significantly higher than the generic
is cumulative time spent). These four patterns can be labeled as            inter-arrival characteristics shown in Figure 1 (d). In figure 4 (c)
follows: (1) mid bloomers (a), these are users who are initially silent     we notice that users who download more apps tend to have higher
and do not use the OTT box heavily, but then suddenly start using           churning rate, we obtained a Pearson correlation coefficient of 0.67.
during the middle phase of their total period. (2) late bloomers (b),       It is also interesting to observe that as the churn increases, users
these users remain dormant for a longer duration with minimal               tend to switch between apps more frequently, where the app switch
activity, but suddenly start using the device towards the end. (3) po-      is indicated by the session feature (y-axis).
tential churners (c), these are users who are of interest to us. As
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                                                               Vineeth Rakesh, et al.


                                                     (a)                                        (b)


                                                     (c)                                       (d)


Figure 1: Generic characteristics of users in D (a) shows the top 10 most frequently used apps, (b) majority of the users spend
2-6hrs per day, (c) users tend to spend more time in the evening and (d) majority of the users tend to return back to their OTT
device within a maximum of 8-9 days (≈ 200 hrs)


Figure 2: The four types of users captured by our clustering framework. Clockwise from top left, (a) mid-bloomers, (b) late-
bloomers, (c) potential-churners and (d) consistent users. The error bars show the standard deviation based on usage.


4    PREDICTING CHURN                                                        Second, not only should we predict the churn with good accuracy,
Given a user u and an app a, our objective is to predict if u will           but also produce highly interpretable results. In other words, we
continue or stop using a. We realize each user-app entity as a tuple         should reason out as to why a user is churning. To achieve this, we
(X , M x , Y ) where X = {x 1 , ..., x t } is a stream of events (or logs)   propose the following models: (1) attention LSTM (ALSTM), which
that spans a time t ∈ T . Each event x comprises of M features,              is a simple modification of the neural machine translation (NMT)
and Y = {y1 , ..., yt } are the binary labels that indicate churn (or        model [2] and (2) neural churn prediction model (NCPM): a more
non-churn) at t. When designing our churn prediction model we                comprehensive model that incorporates temporal-level attention
had two main objectives. First, since our data is highly temporal, it        (or local attention) and feature-level attention (or global attention).
is important to learn the latent characteristics of churners (and non-       Both the models are based on recurrent neural network (RNNs)
churners) in such a way that it embeds the temporality of events.            that have shown to be effective in modeling time-series data [5, 8].
                                                                             RNNs take a series of temporally dependent inputs and learn their
User Modeling and Churn Prediction in
Over-the-top Media Services                                                           ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil


                                      Figure 3: High level summary of user u interacting with application a


                                                       (a)                                      (b)


                                                       (c)                                      (d)


Figure 4: Characteristics of churning OTT devices (a) Most frequently churned applications for the dataset D e (b) the median
inter-arrival is about 25-30 days for top-5 churning apps, (c) users tend to churn more when they download more apps and (d)
users who churn for more apps tend to switch between apps more frequently.


latent representation (or hidden state vector) using the following           Attention LSTM (ALSTM): Obviously one can predict churn by
expression:                                                                  simply providing the input X to a vanilla LSTM, get the latent
                                                                             vectors h from the final layer, and use them as features for prediction.
                              ht = f (ht −1 , x t )                    (1)   A potential issue with this approach is that the compressed (or
where ht is the hidden layer at time t and f is some non linear              low dimensional) latent vector h is inefficient in capturing all the
function. For our application, we model f using long short-term              necessary information that attributes to churn. As explained in
memory network (LSTM) [13]. LSTM has four states that are defined            Section 1, in a particular week we might observe the subscriber
as follows:                                                                  slowly starting to navigate towards a new application and spend less
                                                                             and less time in one that they were previously engaged with (and
           i t = σ (Wi · [ht −1 ; x t ] + bi )                         (2)   eventually abandon). So, it is important to give high priority to these
           ft = σ (Wf · [ht −1 ; x t ] + bf )                                time windows when compared to other weeks. Modeling churn
                                                                             using vanilla LSTM networks fails to prioritize such key events.
           c t = ft × c t −1 + i t × tanh(Wc · [ht −1 ; x t ] + bc )
                                                                             Inspired by recent developments in neural machine translation
           ot = σ (Wo · [ht −1 , x t ] + bo )                                (NMT) [2], we incorporate attentions into LSTM to overcome this
           ht = ot × tanh(c t )                                              issue. Since our application is very different from natural language
                                                                             processing, we introduce two modifications over NMT. First, we
where t is the time step (i.e, days) , ht is the hidden state at t , c t     replace the decoder part with a single layer neural network (NN)
is the cell state at t , x t is the hidden state of the previous layer at    with sigmoid activation for churn prediction and second, we change
time t, i t , ft , ot are the input, forget and out gates, respectively.
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                                                            Vineeth Rakesh, et al.


Figure 5: The end-to-end architecture of the proposed models (a) ALSTM that uses attentions on a single LSTM network to
model churn (b) shows NCPM that uses separate LSTM networks for modeling individual features and predicts the churn
using weighted attentions.


the attention mechanism to suit our problem. The proposed ALSTM            In the above expression, β denotes the individual attention weights
model is shown in Figure 5 (a). Here, the attention block A outputs        that is defined as follows:
a vector of weights α that emphasizes the importance of the latent                                         exp(c m )
vector h for a given time frame t. The weighted latent vector p is                                 β m = PM                                     (6)
                                                                                                                     m
defined as follows:                                                                                       m=1 exp(c )
                                                                                                          Z
                                                                                                   cm
                                                                                                          X
                                                                                                    j =         z i Ui j                        (7)
                                                                                                          i=1
                                    T
                                                                           where z = p 1 ⊕ {p i }m
                                                                                                 2 is the concatenation (indicated by ⊕) of the
                                    X
                              p=            α t ht                   (3)
                                    t =1                                   feature-level latent vectors . Finally, to predict the churn, a linear
                                                                           projection with a sigmoid function is connected to the output of
                                                                           the last layer to produce user churn prediction as follows:
where the weight α j for a time instance t is defined by

                                                                                                    ŷ = σ (Wд · д + bд )                       (8)
                                         exp(s j )
                           α j = PT                                  (4)   The loss for both ALSTM and NCPM is computed using binary
                                     t =1 exp(s t )                        cross entropy, that is defined as follows:
                                  K                                                          X
                                         hkt · Wkt j                                    L=        −yi loд(ŷi ) − (1 − yi )loд(1 − ŷi ) (9)
                                  X
                           sj =
                                                                                               i
                                  k =1
                                                                           5     EXPERIMENTS
Neural churn prediction model (NCPM): One drawback of AL-                  Obviously, from Table 1, one can notice that our dataset is biased
STM is that it is unable to prioritize across features. For example,       towards negative samples (i.e., #non-churns). Therefore, to create a
churning could be more influenced by the consistency of users (see         balanced dataset, for every positive data point (i.e, churns) for an
Figure 2), while the number of downloads might have little impor-          app a, we randomly sample a corresponding negative data point.
tance. To overcome this problem, we incorporate both temporal-             We test our models by varying the number of days in the training
level attention and feature-level attention. As depicted in Figure         sample (explained in Section 2). This helps us to see how quickly
5, instead of treating the features as a single vector, we decouple        can our models predict the churn. For all our experiments, we use
the features and model them using individual ALSTMs. Similar               10 fold cross validation, where eight folds are used for training, one
to ALSTM, the attention block pm of an a feature m captures the            for validation, and one for testing. The deep learning models are
influence (or weight) of the latent features from different slices of      implemented using Keras with Tensorflow as the back-end.
time. On the other hand, the feature-level attention is capture by
the block B, which is defined by the following expressions:                5.1    Baselines
                                                                           We compare the performance of the proposed ALSTM and NCPM
                                                                           with three baseline methods. Unlike the proposed models (i.e., AL-
                                   M                                       STM and NCPM) the following baselines do not capture the tem-
                                           β m pm
                                   X
                             д=                                      (5)   porality in data. Therefore, the inputs to these model are flattened
                                   m=1                                     vectors across the time frames.
User Modeling and Churn Prediction in
Over-the-top Media Services                                                              ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil


Logistic Regression: the classic model for binary classification                corresponding AUC values are listed in Table 4, due to space con-
problem. Albeit simplistic, it helps us to understand if a linear               straints, only the non-default case is furnished. Similar to the ac-
decision boundary is sufficient to capture the churners and the                 curacy scores, for most scenarios, NCPM remains dominant over
non-churners. We use L2 norm as the regularizer and Stochastic                  other models. We also notice that RF tends to perform better than
Average Gradient (SAG) as the solver.                                           NCPM and ALSTM when the temporal length of data is low (i.e.,
Multi-layer Perceptron (MLP): We consider a simple two layer                    with just five days). However, as we incorporate more days for
neural network and a dropout layer to avoid overfitting. A linear               training, there is a clear increase in the performance of our models.
projection with a sigmoid function is connected to the output of                The outcome for dataset D is much different than De where we
the last layer to produce user churn. Similar to logistic regression,           are able to achieve an AUC of almost 89% with just five days of
MLP does not capture the temporal dependencies between the data.                data; additionally, ALSTM seems to perform very similar to NCPM.
The number of neurons are set to 80 for each intermediate layer.                Interpreting the churn prediction: One of the key strengths of
Random Forest (RF): Despite the rapid advancements in the field                 our model is interpretability. As explained in Section 1, ALSTM
of deep learning, ensemble techniques such as RF [4] remain highly              provides single level of interpretability, which indicates which days
competitive in producing excellent results on data with several                 are important when predicting churn. NCPM on the other hand, has
modalities. In our experiments, the number of decision trees are set            two levels; besides telling the important days, it also tells us which
as 50 and the maximum depth as 10.                                              features are important. We present the interpretability scores as
                                                                                heatmaps in Figure 8. Due to the lack of space, we only furnish the
                                                                                results of non-continuous dataset. Heat maps (a)-(d) explains that
5.2     Results                                                                 the influence of features are not uniform across apps; for instance,
Classification accuracy: Tables 2-3 shows that NCPM outper-                     when we have less days for prediction, the churn is influenced
forms all other models for both datasets De and D, achieving an                 by two main attributes namely, the number of downloads and the
accuracy of upto 92%. As we increase the number of days the accu-               cluster types (Figures 8 (a) and (c)). As we incorporate more data
racy increases for all models (except logistic regression). Here, CA-5          for training (i.e., the number of days) the attention tends to get
implies the classification accuracy with just 5 days of data, while             more focused towards a few key features. For non-default apps,
CA-20 implies 20 days of data. We can also see that the proposed                there seems to be more attention on the inter-arrival time, while
ALSTM is not as good as NCPM which proves the following: (1) it                 for all-apps the influence seems to be more towards the number
is important to learn the latent attributes of each individual features         of reboots. It could be possible that these apps experience a higher
separately and (2) incorporating both global and local attention                number of app crashes, which could lead to user rebooting the
is necessary. That being said, ALSTM clearly outperforms MLP,                   device. For non-default apps, Hulu, Plot Tv and Kodi is heavily
which emphasizes the necessity of learning the temporal actions of              influenced by the cluster id feature that we engineered in Section 3.
OTT users. The worst performing model is the logistic regression,               When it comes to temporal attention (Figures 8 (e)-(h)), for dataset
which is just slightly better than a random selection. This illustrates         D, the influence is mainly concentrated on a few selective days,
the difficulty of our churn prediction task. The performance of RF is           i.e., day 4 for non-continuous case, and day 2 for continuous case.
very close to that of ALSTM, which indicates that ensemble models               Contrary to this, for D e this influence is spread across almost all
are still a strong candidate for our problem.                                   days.
    In general, the performance of models over non-continuous data                  In Section 3 we explained that the engagement pattern of users
is much better than its continuous counterpart, this can be explained           could have a strong impact on churn. To show this effect, for each
using the following example. let us say that u uses an app for                  user, we get the final attention score from individual RNNs of
six days before churning and we have the following data for u                   NCPM and plot the outcome in Figure 9 (a). Here, we can see
{m 1 , m 4 , m 6 , m 7 , m 8 , m 10 }, where m is some feature and the suffix   that consistent users are the highest indicators of non-churn, while
indicates the day. Our objective is to predict the outcome on sixth             potential churners are the highest indicators of churn. Interestingly,
day, using the first five days. Since the user does not use the OTT box         mid bloomers seem to have higher attention over late bloomers
for days two, three, and five, the continuous data that is fed to our           when it comes to predicting non-churners, while the opposite is
models (i.e., both NCPM and ALSTM) is essentially a sparse vector               true for churners. Figure 9 (b) provides a more deeper look into this
{m 1 , 0, 0, m 4 , 0}, which has several missing values. On the contrary,       outcome by emphasizing on the importance of temporal progression
for non-continuous dataset, we will have the actual usage values                on the user types. Unsurprisingly, during the initial phase (elapsed
for five days. This obviously means that the model gets to train                duration of 10-20%) almost all types have less attention weights.
with more observed data points, which leads to better prediction                This is because, during the early phase, we do not have enough
accuracy. Another interesting observation is that the performance               data about the user type. As the time progresses, around 20-50%
of all models (except logistic regression) is noticeably better on the          of the elapsed duration, we see that potential churners have the
all-apps dataset. One key reason for this outcome is the popularity             strongest impact on the outcome followed by consistent users and
of the default apps. Apps such as Sling TV, Netflix and Youtube                 late bloomers. Around 50-80% , the impact of potential churners
are significantly popular than other non-default apps. Therefore,               drastically reduces, while mid and late bloomers increase. At the
the models are able to effectively learn the churn patterns for such            final stage (i.e., 80-100%) almost all user types have less importance
apps more effectively.                                                          (or attention). This is because, during the last phase, there is more
AUC and ROC characteristics: Figures 6 and 7 compare the ROC                    available data in the form of other features such as number of
characteristics of the proposed models with other baselines. The
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                                                                Vineeth Rakesh, et al.


                                                  (a)                                                  (b)

                            Model      CA-5     CA-10       CA-15     CA-20   Model      CA-5      CA-10     CA-15   CA-20
                            Logistic   0.56     0.55        0.54      0.54    Logistic   0.56      0.55      0.54    0.54
                            RF         0.65     0.73        0.79      0.80    RF         0.66      0.70      0.73    0.75
                            MLP        0.6      0.66        0.72      0.78    MLP        0.58      0.63      0.70    0.74
                            ALSTM      0.65     0.77        0.83      0.86    ALSTM      0.60      0.67      0.74    0.79
                            NCPM       0.67     0.79        0.84      0.88    NCPM       0.62      0.70      0.76    0.81

Table 2: Classification accuracy (CA) for (a) non-default and non-continuous data across 5-20 days and (b) non-default and
continuous data across 5-20 days.


                                                  (a)                                                  (b)

                            Model      CA-5     CA-10       CA-15     CA-20   Model      CA-5      CA-10     CA-15   CA-20
                            Logistic   0.56     0.55        0.54      0.54    Logistic   0.56      0.55      0.54    0.54
                            RF         0.73     0.78        0.85      0.89    RF         0.72      0.78      0.78    0.83
                            MLP        0.7      0.72        0.73      0.74    MLP        0.62      0.70      0.71    0.74
                            ALSTM      0.78     0.83        0.87      0.91    ALSTM      0.74      0.79      0.83    0.86
                            NCPM       0.78     0.84        0.89      0.92    NCPM       0.76      0.82      0.84    0.89

Table 3: Classification accuracy (CA) for (a) all apps and non-continuous data across 5-20 days and (b) all apps and continuous
data across 5-20 days.


               (a) 5 days                               (b) 10 days                      (c) 15 days                         (d) 20 days


               (e) 5 days                               (f) 10 days                      (g) 15 days                         (h) 20 days

Figure 6: Receiver operating characteristic curve (ROC) of the proposed ALSTM and NCPM model along for non-default apps.
Curves (a)-(d) represent the ROC for non-continuous dataset and (e)-(h) represents the continuous dataset.


downloads and inter-arrival time between apps. Consequentially,               churn is still at infancy. Au et. al. [1] adopt a rule based learning
the model is able to rely on better indicators at the later stage.            technique for early churn prediction. In [29], the authors tackle
                                                                              the problem of churn prediction in mobile apps. They find that
6    RELATED WORK                                                             application performance such as energy consumption and latency
The problem tackled in this paper is related to the following topics:         have a significant impact on retention. [16] use a social influence
(1) churn prediction (2) user behavior modeling and (3) interpretable         based approach for churn prediction. Recently, [26] develop an in-
neural networks. We now detail some existing research that corre-             terpretable framework that constraints the objective of RNN with
spond to these topics.                                                        the outcome of K-means clustering to predict the retention of users
Churn Prediction: User retention (or churn) has been extensively              in Snap Chat.
studied in the field of social computing and human computer inter-            Modeling user behavior: There are numerous research on be-
action (HCI) [9, 14, 27]. However, developing predictive models for           havior modeling [10, 15, 21]. For example, [6] predict user intents
User Modeling and Churn Prediction in
Over-the-top Media Services                                                          ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil


                                          (a)                                                           (b)

                    Model       AUC-5   AUC-10    AUC-15         AUC-20    Model       AUC-5         AUC-10   AUC-15      AUC-20
                   Logistic      0.56    0.55      0.55           0.55    Logistic      0.55          0.53     0.53        0.54
                     RF          0.71    0.81      0.87           0.89      RF          0.71          0.77     0.83        0.86
                    MLP          0.65    0.72      0.79           0.84     MLP          0.62          0.69     0.77        0.83
                   ALSTM         0.66    0.76      0.85           0.89    ALSTM          0.6          0.66     0.71        0.78
                   NCPM          0.72    0.85      0.9            0.91    NCPM          0.66          0.76     0.84        0.88

Table 4: Area under the ROC curve (AUC) for (a) non-default apps and non-continuous data across 5-20 days and (b) non-default
apps and continuous data across 5-20 days.


                (a) 5 days                         (b) 10 days                         (c) 15 days                             (d) 20 days


               (e) 5 days                         (f) 10 days                         (g) 15 days                             (h) 20 days

Figure 7: Receiver operating characteristic curve (ROC) of the proposed ALSTM and NCPM model along for all apps. Curves
(a)-(d) represent the ROC for non-continuous dataset and (e)-(h) represents the continuous dataset.


             (a) De , 10 days                    (b) De , 20 days                    (c) D , 10 days                         (d) D , 20 days


             (e) De , 10 days                    (f) D , 20 days                     (g) De , 10 days                        (h) D , 20 days

Figure 8: The feature-level (a-d) and the temporal-level (e-h) influences on churn prediction for non-continuous dataset. The
gradient of colors denote the proabability scores, where red denotes the highest weight and green denotes the lowest influence.


by leveraging the activity logs in Pinterest. [12] predict the like-      Kickstarter crowdfunding domain using a heterogeneous combina-
lihood of a successful search in web search queries. They show            tion of social communities, popularity of projects and the impact
that user behavior are more predictive of goal success than those         of reward categories. Studies such as [11, 24] and [21] model user
using document relevance.[22] model the behavior of users in the          behavior from sequencial actions such as click streams and social
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                                                                          Vineeth Rakesh, et al.


                                              (a)                                                          (b)


Figure 9: The attention weights of mid bloomers (MB), late bloomers (LB), potential churners (PC), and consistent users (CU):
(a) indicates the overall attention-scores during the final phase of prediction and (b) indicates the attention scores at different
stages of temporal progression.


network activities. [24] use a combination of Mahalanobis distance          REFERENCES
(for detecting outlines) and Markov Chains to model sessions in              [1] Wai-Ho Au, Keith CC Chan, and Xin Yao. 2003. A novel evolutionary data mining
click streams, while [21] use a temporal LDA based approach for                  algorithm with applications to churn prediction. IEEE transactions on evolutionary
                                                                                 computation 7, 6 (2003), 532–545.
tour recommendation in Foursquare.                                           [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-
Interpretable Sequence Modeling: RNNs have become the state-                     chine translation by jointly learning to align and translate. arXiv preprint
                                                                                 arXiv:1409.0473 (2014).
of-the-art technique for sequential modeling [5, 13]. Albeit a plethora      [3] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al. 1994. Learning long-term
of research in the NLP domain [2, 19], extending interpretable RNNs              dependencies with gradient descent is difficult. IEEE transactions on neural
for other real world applications is still an emerging field of research.        networks 5, 2 (1994), 157–166.
                                                                             [4] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
In a recent work, [18] predict the engagement of users in the Snap           [5] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan
Chat app by capturing the in-app action transition patterns as a                 Liu. 2018. Recurrent neural networks for multivariate time series with missing
temporally evolving action graph. [17] develop an interpretable                  values. Scientific reports 8, 1 (2018), 6085.
                                                                             [6] Justin Cheng, Caroline Lo, and Jure Leskovec. 2017. Predicting intent using
LSTM to learn multi-level graph structures in a progressive and                  activity logs: How goal specificity and temporal range affect user behavior. In
stochastic manner. [20] propose a dual stage attention model for                 Proceedings of the 26th International Conference on World Wide Web Companion.
                                                                                 International World Wide Web Conferences Steering Committee, 593–601.
medical diagnostics such as heart failure prediction. Zhou et. al.           [7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,
[28] propose an attention-based RNN that predicts the purchase                   Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase
probability of users for targeted ads. Albeit having a similar NN                representations using RNN encoder-decoder for statistical machine translation.
                                                                                 arXiv preprint arXiv:1406.1078 (2014).
architecture as ours, their problem is quite different than churn            [8] Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. 2016. Using
prediction. Additionally, they modeling of local and global atten-               recurrent neural network models for early detection of heart failure onset. Journal
tion is quite different than ours. To the best of our knowledge, the             of the American Medical Informatics Association 24, 2 (2016), 361–370.
                                                                             [9] Giovanni Luca Ciampaglia and Dario Taraborelli. 2015. MoodBar: Increasing new
only research that closely resembles our work is the churn predic-               user retention in Wikipedia through lightweight socialization. In Proceedings
tion model proposed by Yang et. al. [26]. However, the attention                 of the 18th ACM Conference on Computer Supported Cooperative Work & Social
                                                                                 Computing. ACM, 734–742.
mechanism used in their work is quite different than ours.                  [10] Gideon Dror, Dan Pelleg, Oleg Rokhlenko, and Idan Szpektor. 2012. Churn
                                                                                 prediction in new users of Yahoo! answers. In Proceedings of the 21st International
                                                                                 Conference on World Wide Web. ACM, 829–834.
                                                                            [11] Şule Gündüz and M Tamer Özsu. 2003. A web page prediction model based on
7    CONCLUSION                                                                  click-stream tree representation of user behavior. In Proceedings of the ninth ACM
                                                                                 SIGKDD international conference on Knowledge discovery and data mining. ACM,
In this paper we proposed interpretable recurrent neural network                 535–540.
based models for prediction churn in over the top media (OTT)               [12] Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: user
devices. In the first part of the paper, we analyzed the behavioral              behavior as a predictor of a successful search. In Proceedings of the third ACM
                                                                                 international conference on Web search and data mining. ACM, 221–230.
characteristics of users and found that they can be categorized into        [13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
four main types: mid bloomer, late bloomers, potential churners                  computation 9, 8 (1997), 1735–1780.
                                                                            [14] Selim Ickin, Katarzyna Wac, Markus Fiedler, Lucjan Janowski, Jin-Hyuk Hong,
and consistent users. In the second part, we introduced two models               and Anind K Dey. 2012. Factors influencing quality of experience of commonly
for churn prediction, namely Attention LSTM (ALSTM) and Neural                   used mobile applications. IEEE Communications Magazine 50, 4 (2012), 48–56.
Churn Prediction Model (NCPM). In ALSTM, the prediction of                  [15] Marcel Karnstedt, Matthew Rowe, Jeffrey Chan, Harith Alani, and Conor Hayes.
                                                                                 2011. The effect of user features on churn in social networks. In Proceedings of
churn was done by weighting on individual time frames (temporal-                 the 3rd International Web Science Conference. ACM, 23.
level attention) and (2) NCPM, we used two levels of attentions             [16] Jaya Kawale, Aditya Pal, and Jaideep Srivastava. 2009. Churn prediction in
namely, feature-level and temporal-level. We showed that NCPM                    MMORPGs: A social influence based approach. In 2009 International Conference
                                                                                 on Computational Science and Engineering, Vol. 4. IEEE, 423–428.
outperforms all other models over a wide range of test cases and            [17] Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, and Eric P
achieves an accuracy of upto 89% and AUC of 92%.                                 Xing. 2017. Interpretable structure-evolving LSTM. In Proceedings of the IEEE
                                                                                 Conference on Computer Vision and Pattern Recognition. 1010–1019.
User Modeling and Churn Prediction in
Over-the-top Media Services                                                                        ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil


[18] Yozen Liu, Xiaolin Shi, Lucas Pierce, and Xiang Ren. 2019. Characterizing          [24] Narayanan Sadagopan and Jie Li. 2008. Characterizing typical and atypical user
     and Forecasting User Engagement with In-app Action Graph: A Case Study                  sessions in clickstreams. In Proceedings of the 17th international conference on
     of Snapchat. arXiv preprint arXiv:1906.00355 (2019).                                    World Wide Web. ACM, 885–894.
[19] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khu-      [25] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
     danpur. 2010. Recurrent neural network based language model. In Eleventh                with neural networks. In Advances in neural information processing systems. 3104–
     annual conference of the international speech communication association.                3112.
[20] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison         [26] Carl Yang, Xiaolin Shi, Luo Jie, and Jiawei Han. 2018. I Know You’ll Be Back:
     Cottrell. 2017. A dual-stage attention-based recurrent neural network for time          Interpretable New User Clustering and Churn Prediction on a Mobile Social
     series prediction. arXiv preprint arXiv:1704.02971 (2017).                              Application. In Proceedings of the 24th ACM SIGKDD International Conference on
[21] Vineeth Rakesh, Niranjan Jadhav, Alexander Kotov, and Chandan K Reddy. 2017.            Knowledge Discovery & Data Mining. ACM, 914–922.
     Probabilistic social sequential model for tour recommendation. In Proceedings of   [27] Igor Zakhlebin, Em Horvát, et al. 2019. Investor Retention in Equity Crowdfund-
     the Tenth ACM International Conference on Web Search and Data Mining. ACM,              ing. In Proceedings of the 10th ACM Conference on Web Science. ACM, 343–351.
     631–640.                                                                           [28] Yichao Zhou, Shaunak Mishra, Jelena Gligorijevic, Tarun Bhatia, and Narayan
[22] Vineeth Rakesh, Wang-Chien Lee, and Chandan K Reddy. 2016. Probabilistic                Bhamidipati. 2019. Understanding Consumer Journey using Attention based
     group recommendation model for crowdfunding domains. In Proceedings of the              Recurrent Neural Networks. In Proceedings of the 25th ACM SIGKDD International
     Ninth ACM International Conference on Web Search and Data Mining. ACM, 257–             Conference on Knowledge Discovery & Data Mining. ACM, 3102–3111.
     266.                                                                               [29] Agustin Zuniga, Huber Flores, Eemil Lagerspetz, Petteri Nurmi, Sasu Tarkoma,
[23] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and         Pan Hui, and Jukka Manner. 2019. Tortoise or Hare? Quantifying the Effects of
     validation of cluster analysis. Journal of computational and applied mathematics        Performance on Mobile App Retention. In The World Wide Web Conference. ACM,
     20 (1987), 53–65.                                                                       2517–2528.