=Paper=
{{Paper
|id=Vol-2715/paper8
|storemode=property
|title=User Modeling and Churn Prediction in Over-the-top Media Services
|pdfUrl=https://ceur-ws.org/Vol-2715/paper8.pdf
|volume=Vol-2715
|authors=Vineeth Rakesh,Ajith Pudiyavitil,Jaideep Chandrashekar
|dblpUrl=https://dblp.org/rec/conf/recsys/RakeshPC20
}}
==User Modeling and Churn Prediction in Over-the-top Media Services==
User Modeling and Churn Prediction in
Over-the-top Media Services
Vineeth Rakesh Ajith Pudiyavitil∗ Jaideep Chandrashekar
Interdigital AI Lab, USA Lowe’s, USA Interdigital AI Lab, USA
vineeth.mohan@interdigital.com ajithkp12@gmail.com jaideep.chandrashekar@interdigital.com
ABSTRACT stops using the service. By examining high level data collected on
We address the problem of customer retention (churn) in applica- one such OTT hardware platform, we propose feature engineering
tions installed on over the top (OTT) streaming devices. In the first techniques for modeling user behavior and leverage these features
part of our work, we analyze various behavioral characteristics to develop application-level churn prediction models. Specifically,
of users that drive application usage. By examining a variety of given a user u who installs an application a at a given time on their
statistical measures, we answer the following questions: (1) how do OTT device, our model predicts whether u will be engaged (or not
users allocate time across various applications?, (2) how consistently engaged) with a after a particular time window.
do users engage with their devices? and (3) how likely are dormant Users may decide to abandon a streaming service for any number
users liable to becoming active again? In the second part, we leverage of reasons such as a limited time budget to consume content, an
these insights to design interpretable churn prediction models that increasing affinity for a different application or a lack of compelling
learn the latent characteristics of users by prioritizing the specifica- new content. Yet another reason may be that the user experiences
tions of the users. Specifically, we propose the following models: more hardware faults (i.e. reboots, poor wifi, high memory us-
(1) Attention LSTM (ALSTM), where churn prediction is done using age, etc.) when a particular service is being used. In such cases,
a single level of attention by weighting on individual time frames the user perceives these faults as being caused by the application
(temporal-level attention) and (2) Neural Churn Prediction Model and quits out of frustration. The key observation here is that both
(NCPM), a more comprehensive model that uses two levels of atten- application-level and device-level behavior can influence user churn.
tions, one for measuring the temporality of each feature and another Consequently, it is critical to model these heterogeneous factors
to measure the influence across features (feature-level attention). along with temporally correlated features reflecting usage of differ-
Using a series of experiments, we show that our models provide ent applications to accurately predict churn. To achieve this, we first
good churn prediction accuracy with interpretable reasoning. We analyze a dataset that captures high level events on OTT devices
believe that the data analysis, feature engineering and modeling and examine the signals of application churn. These events could be
techniques presented in this work can help organizations better user-initiated (i.e. opening or closing an application, restarting the
understand the reason behind user churn on OTT devices. device, putting the device to sleep, etc.) or device-specific events
(i.e. automatic reboots, wifi drops, software/firmware updates, etc.)
Reference Format:
With this data, we examine questions such as: (1) how users allocate
Vineeth Rakesh, Ajith Pudiyavitil, and Jaideep Chandrashekar. 2020. User
Modeling and Churn Prediction in Over-the-top Media Services. In 3rd
time across applications on their OTT device, (2) how often and for
Workshop on Online Recommender Systems and User Modeling (ORSUM 2020), how long users engage with the device (specific applications), and
in conjunction with the 14th ACM Conference on Recommender Systems, (3) how long do users go dormant, and how likely is it for a dormant
September 25th, 2020, Virtual Event, Brazil. user to become reactive. In the second part of our work, we leverage
these statistical insights to design interpretable models that are
1 INTRODUCTION effective in predicting churn across a wide range of scenarios.
In recent years, users have increasingly taken to consuming stream- The naive approach to building a model that predicts churn
ing video services via applications (i.e. Netflix, Hulu, Youtube, etc.) would be to fix an observation time window T , extract a number of
on so called over the top (OTT) platforms (i.e. AppleTV, Roku, features of interest |m| from this window and then deploy a suitable
Amazon FireTV, etc.). Given the very large (and still growing) num- classification algorithm that predicts whether a subscriber will quit
ber of streaming services, there is fierce competition to attract a service after a period of time T . While this is entirely viable, there
new customers, while maintaining customer satisfaction. Unfortu- are two important drawbacks. First, the data is inherently noisy
nately, there is significant cost to attracting new users; thus, service and high-dimensional; OTT devices send out periodic device-level
providers are very invested in retaining end-users and keeping them and application level summaries (the start time and duration of an
engaged with their products. These customer retention efforts fo- application session), and events observed on the box. This results
cus on providing exclusive and engaging content, personalized in a feature space of size T × |m|. Second, there is significant inher-
recommendations and intuitive user interfaces. When such efforts ent temporal correlation in the data; if a user spends a significant
fail, the operators experience customer churn, wherein a subscriber amount of time inside an application on successive days, there is a
strong signal that he/she will engage with the same application on
∗ This work was done when the author was at Interdigital AI Lab.
the next day. Flattening data over the entire window T into a single
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil
representation vector will lead to this information being lost. An
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons alternative, more principled approach, is to learn latent attributes
License Attribution 4.0 International (CC BY 4.0).
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Vineeth Rakesh, et al.
of the data using time-series models such as recurrent neural net- AndroidTV based OTT devices deployed in homes and operated
works (RNN) [3] and use them as features for churn prediction. A by a large provider. Each of these devices comes pre-installed with
potential issue with this approach is that the compressed latent a set of applications; apart from these, users (u) can download
vector is inefficient at capturing all the necessary information that others from a large catalog on the app store. We observed over
leads to churn. Furthermore, it is extremely difficult to interpret 3.7k distinct applications being used in the dataset. However, some
the results of vanilla RNNs. devices and applications are used very sporadically and account
In this paper, we propose a model that address the drawbacks of for very little data. To remove these, we pre-processed the data
the more conventional approaches. First, we introduce Attention to filter out devices that were active for less than 90 days in total,
LSTM (ALSTM), where we modify the neural machine transla- and removed applications that were used on fewer than 15 devices
tion (NMT) model [2] for churn prediction; ALSTM models the in our population. This filtering resulted in 14,082 unique users
local attention. Second, we propose Neural Churn Prediction Model (i.e., OTT devices) 1 and 462 unique apps; this dataset is denoted
(NCPM), which incorporates two levels of attention i.e., local and by D. Note that the churn prediction model builds features over
global. ALSTM uses temporal-level attention (local attention) where the lifetime of the application on the device; this starts from the
different (sub) observation windows contribute different weights time the application was installed, to when the user is deemed to have
towards predicting churn. For example, in a particular week we abandoned it. Unfortunately, for the pre-installed applications such
might observe the subscriber slowly starting to watch more and as Netflix, YouTube and SlingTV there is no install date. One can
more content on a new application, and spending less and less time exclude such apps from the data; however, this leads to removing a
in another in which they were previously engaged (and eventu- significant number of users. This is because a large proportion of
ally abandon). Thus, features collected in this week might require users tend to confine themselves to using the pre-installed apps (which
higher priority when compared to other time frames. NCPM on also happens to be the popular streaming services). Alternatively,
the other hand is a more comprehensive model that captures each having all users in a single bin could lead to some serious bias in
feature using a separate ALSTM. The individual ALSTMs are then our modeling since for default apps, we do not have a clear signal
combined with a feature-level attention layer (global attention). on when it was downloaded. The user might have been using the
Global attentions are much better at prioritizing weights across app well before the onset of our study. Therefore, besides D, we
different temporal features. For example, churning could be more create a separate dataset De that completely excludes the default
influenced by device-level issues such as periodic reboots rather apps, while for (u, a) ∈ D that do not have a start date, we simply
than application engagement. Although attention-based RNNs have take the first log entry of a by u as a proxy for the actual install
been extensively used in the field of natural language processing date. The subset De covers 8,223 unique devices using 397 distinct
[7, 25], they have very rarely been applied to the problem of churn applications.
prediction. To our knowledge, the only work that appears to ad- Clearly, the earlier a service provider is able to predict churn
dress this area is [26]; however, the attention mechanism used in (of u), the more effectively they can take action and address the
their work is quite different than ours. We summarize the major underlying reasons as to why a customer might be departing. Con-
contributions of our work as follows: sequently, we divide D and D e into different days of activities A,
• Understanding user behavior: Through data engineering and where A = {t |t ≤ T ,T ∈ {5, 10, 15, 20}}. For instance, T = 5 means
statistical analysis, we provide several insights that explain the we consider a maximum of five days of user activity to predict
behavior of users in our OTT dataset. the churn. Table 1 shows the characteristics of dataset D and De
• Predicting Churn: We propose attention-based RNN models across different activity days. In the upcoming section, we explain
that learn the characteristics of users in a weighted low-dimensional the methodology of determining the churners and non-churners
latent space. (i.e., columns four and five in Table 1). Here, one can see that as T
• State-of-the-art performance: By conducting extensive exper- increases, the number of users decrease. This is because the number
iments on a real-world dataset, we show that NCPM outperforms of users that continuously use the OTT for say 20 days is far less
all other models over different test cases and achieves an accu- than than those who use for just 5 days.
racy of upto 89% and AUC of 92%. Additionally, NCPM interprets
the reason for churning by emphasizing on features such as inter- 3 ANALYZING USER CHARACTERISTICS
arrival time between the apps and consistency in app usage. App and Device Usage: Figure 1 (a) shows the ten most popular
We begin by introducing our dataset in Section 2 and in Section 3, apps seen in our dataset, based on the number of users that regularly
we model the behavior of OTT customers as observed in this dataset. use them. Sling Tv, Netflix, Youtube and Google games occupy the
The churn prediction models ALSTM and NCPM are proposed in top four spots. Figure 1 (b) plots the distribution of daily time spent
Section 4 followed by the results of our experiments in Section 5. inside each of the applications. We find that users spend 3-4 hours
Finally, we review related work in Section 6 and conclude our paper on average with the OTT device, with a very small fraction of
in Section 7. users that spend more than 8 hours. In fig. 1 (c), we break this daily
spend into four different parts of a day, corrsponding to morning
(5am-12pm), afternoon (12pm-5pm), evening (5pm-10pm) and night
2 DATASET
(10pm-5am). Unsurprisingly, we observe that users tend to spend
Our dataset consists of high level application session data (start more time in the evening when compared to other periods of a
time and durations) and device level events that spans from Sept
2018 to April 2019. The data was collected from a sample of 31k 1 we use the words devices and users interchangeably
User Modeling and Churn Prediction in
Over-the-top Media Services ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil
Dataset D Dataset De
T #Users #Apps #Churns #NonChurns #Users #Apps #Churns #NonChurns
5 13082 402 9017 16044 6390 283 8712 9421
10 7954 347 6339 8596 5500 234 4661 7538
15 5440 256 4123 5932 4929 194 3088 6423
20 5009 179 3193 4067 4453 165 2231 5548
Table 1: Statistics of the churn prediction datasets D and De for different range of activity days T . For example, T = 5 implies
for a given user-app (u, a) tuple, u actively used a upto 5 days before churning.
day (median of 2.2 hrs). However, it is not significantly higher explained by the plot, these users start using the device heavily at
than afternoon, which has a median of 1.8 hrs and morning with first, but then stop using the box for various reasons. Please note
a median of 1.4 hrs. Another key statistic of interest is the inter that since y-axis is the CDF, the flat line here indicates minimal or
arrival time between application sessions. Note that there may not no activity (also indicated by very low standard deviation, since
be an explicit indicator of a user quitting an application. Very often, there is no activity). (4) consistent users (d), finally, these are users
users just stop using the application that is installed, or deactivate who are consistent and regularly use their OTT boxes to watch
their accounts but keep the application installed. Thus, churn must different shows.
be detected implicitly, i.e., by the fact of the application not being Understanding Churning Behavior: Before introducing our pre-
started by the user for a sufficiently long time. We calculate the diction models, we briefly explain how we label a user (or device)
arrival time between successive start times of an application session having churned, i.e., left the service. This is fundamentally a difficult
on a device (across all applications) and plot the maximum values, task because there is no explicit signal for this behavior. Further
across all the devices, in fig. 1 (d). We observe that users, after an complicating things, (a) some users don’t use the app for a few
absence, return to the OTT applications within a median time of days, but return back after a brief period of inactivity, and (b) some
6 days (100-200 hours). The 75%-ile value of this distribution is users simply download the app once (or spend a brief amount of
about 10 days. Later in this section, we leverage this to establish an time in it) and never use it again. Figure 3 depics, at a high level,
inactivity threshold when we define churn more precisely. all the information for a user (u) application (a). The dotted lines at
User Engagement Patterns: Here we try to understand how users either end capture the time data was collected and each of the green
spend time on their OTT device. We wish to explore the following vertical lines in the middle indicate the start of application sessions
aspects: (a) Are there users who consistently use the box for same (a 1 is the first session, an is the last). Here, we see that the applica-
number of hours every day? (b) Are there dormant users who don’t tion was downloaded after the start of the data collection and used
use their device for a while, but then reactivate it? (c) Are there several times, the last instance is at t 3 . Somewhat infrequently, we
users that engage with their device, but only intermittently and for see the device itself disappears from the dataset; we consider this a
brief periods of time? To answer these, we carry out the following signal that the user has disconnected the device and is no longer
analysis. First, for each day that our dataset spans, we compute the using it. In this scenario, we capture this event having occured at t 4 .
cumulative time (in terms of cummulative distribution function - With this depiction, we can now define churn in very specific terms
CDF) that the user engaged with the device. Specifically, we com- by addressing the two challenges previously discussed. First, we
pute (i, c i ) for each user, where i = 1, 2, .., |D| represents each day require that the application not be used for a period of time after the
in our dataset, and c i is the total number of hours spent on the last use. Following the example in fig. 3, we impose the condition
device upto day i. Next, we carry out a non-linear fit on this data ∆2 ≥ T3q . Here T3q – an inactivity threshold – is the 3rd quartile
for each user, recording the learned slope, intercept and standard of the distribution in fig. 1 (d) and turns out to be 10 days. Second,
error as derived features for each user. Subsequently, we cluster we require a minimum number of sessions to be recorded for an
the derived features using K-means; the number of clusters is de- application and user. Specifically, u should have engaged with a at
termined based on the silhouette score [23]. This analysis yields least K times, i.e., n ≥ K and we set K = 3. Figure 4 illustrates the
four main behavior patterns that cover the vast majority of users, characteristics of devices that are exclusively labeled as churn. We
and are depicted in fig. 2. Each plot is based on the original data of see that the median inter-arrival times for the top 5 apps is around
cumulative device engagement time (x-axis is days elapsed, y-axis 30 days (Figure 4 (b)), which is significantly higher than the generic
is cumulative time spent). These four patterns can be labeled as inter-arrival characteristics shown in Figure 1 (d). In figure 4 (c)
follows: (1) mid bloomers (a), these are users who are initially silent we notice that users who download more apps tend to have higher
and do not use the OTT box heavily, but then suddenly start using churning rate, we obtained a Pearson correlation coefficient of 0.67.
during the middle phase of their total period. (2) late bloomers (b), It is also interesting to observe that as the churn increases, users
these users remain dormant for a longer duration with minimal tend to switch between apps more frequently, where the app switch
activity, but suddenly start using the device towards the end. (3) po- is indicated by the session feature (y-axis).
tential churners (c), these are users who are of interest to us. As
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Vineeth Rakesh, et al.
(a) (b)
(c) (d)
Figure 1: Generic characteristics of users in D (a) shows the top 10 most frequently used apps, (b) majority of the users spend
2-6hrs per day, (c) users tend to spend more time in the evening and (d) majority of the users tend to return back to their OTT
device within a maximum of 8-9 days (≈ 200 hrs)
Figure 2: The four types of users captured by our clustering framework. Clockwise from top left, (a) mid-bloomers, (b) late-
bloomers, (c) potential-churners and (d) consistent users. The error bars show the standard deviation based on usage.
4 PREDICTING CHURN Second, not only should we predict the churn with good accuracy,
Given a user u and an app a, our objective is to predict if u will but also produce highly interpretable results. In other words, we
continue or stop using a. We realize each user-app entity as a tuple should reason out as to why a user is churning. To achieve this, we
(X , M x , Y ) where X = {x 1 , ..., x t } is a stream of events (or logs) propose the following models: (1) attention LSTM (ALSTM), which
that spans a time t ∈ T . Each event x comprises of M features, is a simple modification of the neural machine translation (NMT)
and Y = {y1 , ..., yt } are the binary labels that indicate churn (or model [2] and (2) neural churn prediction model (NCPM): a more
non-churn) at t. When designing our churn prediction model we comprehensive model that incorporates temporal-level attention
had two main objectives. First, since our data is highly temporal, it (or local attention) and feature-level attention (or global attention).
is important to learn the latent characteristics of churners (and non- Both the models are based on recurrent neural network (RNNs)
churners) in such a way that it embeds the temporality of events. that have shown to be effective in modeling time-series data [5, 8].
RNNs take a series of temporally dependent inputs and learn their
User Modeling and Churn Prediction in
Over-the-top Media Services ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil
Figure 3: High level summary of user u interacting with application a
(a) (b)
(c) (d)
Figure 4: Characteristics of churning OTT devices (a) Most frequently churned applications for the dataset D e (b) the median
inter-arrival is about 25-30 days for top-5 churning apps, (c) users tend to churn more when they download more apps and (d)
users who churn for more apps tend to switch between apps more frequently.
latent representation (or hidden state vector) using the following Attention LSTM (ALSTM): Obviously one can predict churn by
expression: simply providing the input X to a vanilla LSTM, get the latent
vectors h from the final layer, and use them as features for prediction.
ht = f (ht −1 , x t ) (1) A potential issue with this approach is that the compressed (or
where ht is the hidden layer at time t and f is some non linear low dimensional) latent vector h is inefficient in capturing all the
function. For our application, we model f using long short-term necessary information that attributes to churn. As explained in
memory network (LSTM) [13]. LSTM has four states that are defined Section 1, in a particular week we might observe the subscriber
as follows: slowly starting to navigate towards a new application and spend less
and less time in one that they were previously engaged with (and
i t = σ (Wi · [ht −1 ; x t ] + bi ) (2) eventually abandon). So, it is important to give high priority to these
ft = σ (Wf · [ht −1 ; x t ] + bf ) time windows when compared to other weeks. Modeling churn
using vanilla LSTM networks fails to prioritize such key events.
c t = ft × c t −1 + i t × tanh(Wc · [ht −1 ; x t ] + bc )
Inspired by recent developments in neural machine translation
ot = σ (Wo · [ht −1 , x t ] + bo ) (NMT) [2], we incorporate attentions into LSTM to overcome this
ht = ot × tanh(c t ) issue. Since our application is very different from natural language
processing, we introduce two modifications over NMT. First, we
where t is the time step (i.e, days) , ht is the hidden state at t , c t replace the decoder part with a single layer neural network (NN)
is the cell state at t , x t is the hidden state of the previous layer at with sigmoid activation for churn prediction and second, we change
time t, i t , ft , ot are the input, forget and out gates, respectively.
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Vineeth Rakesh, et al.
Figure 5: The end-to-end architecture of the proposed models (a) ALSTM that uses attentions on a single LSTM network to
model churn (b) shows NCPM that uses separate LSTM networks for modeling individual features and predicts the churn
using weighted attentions.
the attention mechanism to suit our problem. The proposed ALSTM In the above expression, β denotes the individual attention weights
model is shown in Figure 5 (a). Here, the attention block A outputs that is defined as follows:
a vector of weights α that emphasizes the importance of the latent exp(c m )
vector h for a given time frame t. The weighted latent vector p is β m = PM (6)
m
defined as follows: m=1 exp(c )
Z
cm
X
j = z i Ui j (7)
i=1
T
where z = p 1 ⊕ {p i }m
2 is the concatenation (indicated by ⊕) of the
X
p= α t ht (3)
t =1 feature-level latent vectors . Finally, to predict the churn, a linear
projection with a sigmoid function is connected to the output of
the last layer to produce user churn prediction as follows:
where the weight α j for a time instance t is defined by
ŷ = σ (Wд · д + bд ) (8)
exp(s j )
α j = PT (4) The loss for both ALSTM and NCPM is computed using binary
t =1 exp(s t ) cross entropy, that is defined as follows:
K X
hkt · Wkt j L= −yi loд(ŷi ) − (1 − yi )loд(1 − ŷi ) (9)
X
sj =
i
k =1
5 EXPERIMENTS
Neural churn prediction model (NCPM): One drawback of AL- Obviously, from Table 1, one can notice that our dataset is biased
STM is that it is unable to prioritize across features. For example, towards negative samples (i.e., #non-churns). Therefore, to create a
churning could be more influenced by the consistency of users (see balanced dataset, for every positive data point (i.e, churns) for an
Figure 2), while the number of downloads might have little impor- app a, we randomly sample a corresponding negative data point.
tance. To overcome this problem, we incorporate both temporal- We test our models by varying the number of days in the training
level attention and feature-level attention. As depicted in Figure sample (explained in Section 2). This helps us to see how quickly
5, instead of treating the features as a single vector, we decouple can our models predict the churn. For all our experiments, we use
the features and model them using individual ALSTMs. Similar 10 fold cross validation, where eight folds are used for training, one
to ALSTM, the attention block pm of an a feature m captures the for validation, and one for testing. The deep learning models are
influence (or weight) of the latent features from different slices of implemented using Keras with Tensorflow as the back-end.
time. On the other hand, the feature-level attention is capture by
the block B, which is defined by the following expressions: 5.1 Baselines
We compare the performance of the proposed ALSTM and NCPM
with three baseline methods. Unlike the proposed models (i.e., AL-
M STM and NCPM) the following baselines do not capture the tem-
β m pm
X
д= (5) porality in data. Therefore, the inputs to these model are flattened
m=1 vectors across the time frames.
User Modeling and Churn Prediction in
Over-the-top Media Services ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil
Logistic Regression: the classic model for binary classification corresponding AUC values are listed in Table 4, due to space con-
problem. Albeit simplistic, it helps us to understand if a linear straints, only the non-default case is furnished. Similar to the ac-
decision boundary is sufficient to capture the churners and the curacy scores, for most scenarios, NCPM remains dominant over
non-churners. We use L2 norm as the regularizer and Stochastic other models. We also notice that RF tends to perform better than
Average Gradient (SAG) as the solver. NCPM and ALSTM when the temporal length of data is low (i.e.,
Multi-layer Perceptron (MLP): We consider a simple two layer with just five days). However, as we incorporate more days for
neural network and a dropout layer to avoid overfitting. A linear training, there is a clear increase in the performance of our models.
projection with a sigmoid function is connected to the output of The outcome for dataset D is much different than De where we
the last layer to produce user churn. Similar to logistic regression, are able to achieve an AUC of almost 89% with just five days of
MLP does not capture the temporal dependencies between the data. data; additionally, ALSTM seems to perform very similar to NCPM.
The number of neurons are set to 80 for each intermediate layer. Interpreting the churn prediction: One of the key strengths of
Random Forest (RF): Despite the rapid advancements in the field our model is interpretability. As explained in Section 1, ALSTM
of deep learning, ensemble techniques such as RF [4] remain highly provides single level of interpretability, which indicates which days
competitive in producing excellent results on data with several are important when predicting churn. NCPM on the other hand, has
modalities. In our experiments, the number of decision trees are set two levels; besides telling the important days, it also tells us which
as 50 and the maximum depth as 10. features are important. We present the interpretability scores as
heatmaps in Figure 8. Due to the lack of space, we only furnish the
results of non-continuous dataset. Heat maps (a)-(d) explains that
5.2 Results the influence of features are not uniform across apps; for instance,
Classification accuracy: Tables 2-3 shows that NCPM outper- when we have less days for prediction, the churn is influenced
forms all other models for both datasets De and D, achieving an by two main attributes namely, the number of downloads and the
accuracy of upto 92%. As we increase the number of days the accu- cluster types (Figures 8 (a) and (c)). As we incorporate more data
racy increases for all models (except logistic regression). Here, CA-5 for training (i.e., the number of days) the attention tends to get
implies the classification accuracy with just 5 days of data, while more focused towards a few key features. For non-default apps,
CA-20 implies 20 days of data. We can also see that the proposed there seems to be more attention on the inter-arrival time, while
ALSTM is not as good as NCPM which proves the following: (1) it for all-apps the influence seems to be more towards the number
is important to learn the latent attributes of each individual features of reboots. It could be possible that these apps experience a higher
separately and (2) incorporating both global and local attention number of app crashes, which could lead to user rebooting the
is necessary. That being said, ALSTM clearly outperforms MLP, device. For non-default apps, Hulu, Plot Tv and Kodi is heavily
which emphasizes the necessity of learning the temporal actions of influenced by the cluster id feature that we engineered in Section 3.
OTT users. The worst performing model is the logistic regression, When it comes to temporal attention (Figures 8 (e)-(h)), for dataset
which is just slightly better than a random selection. This illustrates D, the influence is mainly concentrated on a few selective days,
the difficulty of our churn prediction task. The performance of RF is i.e., day 4 for non-continuous case, and day 2 for continuous case.
very close to that of ALSTM, which indicates that ensemble models Contrary to this, for D e this influence is spread across almost all
are still a strong candidate for our problem. days.
In general, the performance of models over non-continuous data In Section 3 we explained that the engagement pattern of users
is much better than its continuous counterpart, this can be explained could have a strong impact on churn. To show this effect, for each
using the following example. let us say that u uses an app for user, we get the final attention score from individual RNNs of
six days before churning and we have the following data for u NCPM and plot the outcome in Figure 9 (a). Here, we can see
{m 1 , m 4 , m 6 , m 7 , m 8 , m 10 }, where m is some feature and the suffix that consistent users are the highest indicators of non-churn, while
indicates the day. Our objective is to predict the outcome on sixth potential churners are the highest indicators of churn. Interestingly,
day, using the first five days. Since the user does not use the OTT box mid bloomers seem to have higher attention over late bloomers
for days two, three, and five, the continuous data that is fed to our when it comes to predicting non-churners, while the opposite is
models (i.e., both NCPM and ALSTM) is essentially a sparse vector true for churners. Figure 9 (b) provides a more deeper look into this
{m 1 , 0, 0, m 4 , 0}, which has several missing values. On the contrary, outcome by emphasizing on the importance of temporal progression
for non-continuous dataset, we will have the actual usage values on the user types. Unsurprisingly, during the initial phase (elapsed
for five days. This obviously means that the model gets to train duration of 10-20%) almost all types have less attention weights.
with more observed data points, which leads to better prediction This is because, during the early phase, we do not have enough
accuracy. Another interesting observation is that the performance data about the user type. As the time progresses, around 20-50%
of all models (except logistic regression) is noticeably better on the of the elapsed duration, we see that potential churners have the
all-apps dataset. One key reason for this outcome is the popularity strongest impact on the outcome followed by consistent users and
of the default apps. Apps such as Sling TV, Netflix and Youtube late bloomers. Around 50-80% , the impact of potential churners
are significantly popular than other non-default apps. Therefore, drastically reduces, while mid and late bloomers increase. At the
the models are able to effectively learn the churn patterns for such final stage (i.e., 80-100%) almost all user types have less importance
apps more effectively. (or attention). This is because, during the last phase, there is more
AUC and ROC characteristics: Figures 6 and 7 compare the ROC available data in the form of other features such as number of
characteristics of the proposed models with other baselines. The
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Vineeth Rakesh, et al.
(a) (b)
Model CA-5 CA-10 CA-15 CA-20 Model CA-5 CA-10 CA-15 CA-20
Logistic 0.56 0.55 0.54 0.54 Logistic 0.56 0.55 0.54 0.54
RF 0.65 0.73 0.79 0.80 RF 0.66 0.70 0.73 0.75
MLP 0.6 0.66 0.72 0.78 MLP 0.58 0.63 0.70 0.74
ALSTM 0.65 0.77 0.83 0.86 ALSTM 0.60 0.67 0.74 0.79
NCPM 0.67 0.79 0.84 0.88 NCPM 0.62 0.70 0.76 0.81
Table 2: Classification accuracy (CA) for (a) non-default and non-continuous data across 5-20 days and (b) non-default and
continuous data across 5-20 days.
(a) (b)
Model CA-5 CA-10 CA-15 CA-20 Model CA-5 CA-10 CA-15 CA-20
Logistic 0.56 0.55 0.54 0.54 Logistic 0.56 0.55 0.54 0.54
RF 0.73 0.78 0.85 0.89 RF 0.72 0.78 0.78 0.83
MLP 0.7 0.72 0.73 0.74 MLP 0.62 0.70 0.71 0.74
ALSTM 0.78 0.83 0.87 0.91 ALSTM 0.74 0.79 0.83 0.86
NCPM 0.78 0.84 0.89 0.92 NCPM 0.76 0.82 0.84 0.89
Table 3: Classification accuracy (CA) for (a) all apps and non-continuous data across 5-20 days and (b) all apps and continuous
data across 5-20 days.
(a) 5 days (b) 10 days (c) 15 days (d) 20 days
(e) 5 days (f) 10 days (g) 15 days (h) 20 days
Figure 6: Receiver operating characteristic curve (ROC) of the proposed ALSTM and NCPM model along for non-default apps.
Curves (a)-(d) represent the ROC for non-continuous dataset and (e)-(h) represents the continuous dataset.
downloads and inter-arrival time between apps. Consequentially, churn is still at infancy. Au et. al. [1] adopt a rule based learning
the model is able to rely on better indicators at the later stage. technique for early churn prediction. In [29], the authors tackle
the problem of churn prediction in mobile apps. They find that
6 RELATED WORK application performance such as energy consumption and latency
The problem tackled in this paper is related to the following topics: have a significant impact on retention. [16] use a social influence
(1) churn prediction (2) user behavior modeling and (3) interpretable based approach for churn prediction. Recently, [26] develop an in-
neural networks. We now detail some existing research that corre- terpretable framework that constraints the objective of RNN with
spond to these topics. the outcome of K-means clustering to predict the retention of users
Churn Prediction: User retention (or churn) has been extensively in Snap Chat.
studied in the field of social computing and human computer inter- Modeling user behavior: There are numerous research on be-
action (HCI) [9, 14, 27]. However, developing predictive models for havior modeling [10, 15, 21]. For example, [6] predict user intents
User Modeling and Churn Prediction in
Over-the-top Media Services ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil
(a) (b)
Model AUC-5 AUC-10 AUC-15 AUC-20 Model AUC-5 AUC-10 AUC-15 AUC-20
Logistic 0.56 0.55 0.55 0.55 Logistic 0.55 0.53 0.53 0.54
RF 0.71 0.81 0.87 0.89 RF 0.71 0.77 0.83 0.86
MLP 0.65 0.72 0.79 0.84 MLP 0.62 0.69 0.77 0.83
ALSTM 0.66 0.76 0.85 0.89 ALSTM 0.6 0.66 0.71 0.78
NCPM 0.72 0.85 0.9 0.91 NCPM 0.66 0.76 0.84 0.88
Table 4: Area under the ROC curve (AUC) for (a) non-default apps and non-continuous data across 5-20 days and (b) non-default
apps and continuous data across 5-20 days.
(a) 5 days (b) 10 days (c) 15 days (d) 20 days
(e) 5 days (f) 10 days (g) 15 days (h) 20 days
Figure 7: Receiver operating characteristic curve (ROC) of the proposed ALSTM and NCPM model along for all apps. Curves
(a)-(d) represent the ROC for non-continuous dataset and (e)-(h) represents the continuous dataset.
(a) De , 10 days (b) De , 20 days (c) D , 10 days (d) D , 20 days
(e) De , 10 days (f) D , 20 days (g) De , 10 days (h) D , 20 days
Figure 8: The feature-level (a-d) and the temporal-level (e-h) influences on churn prediction for non-continuous dataset. The
gradient of colors denote the proabability scores, where red denotes the highest weight and green denotes the lowest influence.
by leveraging the activity logs in Pinterest. [12] predict the like- Kickstarter crowdfunding domain using a heterogeneous combina-
lihood of a successful search in web search queries. They show tion of social communities, popularity of projects and the impact
that user behavior are more predictive of goal success than those of reward categories. Studies such as [11, 24] and [21] model user
using document relevance.[22] model the behavior of users in the behavior from sequencial actions such as click streams and social
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Vineeth Rakesh, et al.
(a) (b)
Figure 9: The attention weights of mid bloomers (MB), late bloomers (LB), potential churners (PC), and consistent users (CU):
(a) indicates the overall attention-scores during the final phase of prediction and (b) indicates the attention scores at different
stages of temporal progression.
network activities. [24] use a combination of Mahalanobis distance REFERENCES
(for detecting outlines) and Markov Chains to model sessions in [1] Wai-Ho Au, Keith CC Chan, and Xin Yao. 2003. A novel evolutionary data mining
click streams, while [21] use a temporal LDA based approach for algorithm with applications to churn prediction. IEEE transactions on evolutionary
computation 7, 6 (2003), 532–545.
tour recommendation in Foursquare. [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-
Interpretable Sequence Modeling: RNNs have become the state- chine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473 (2014).
of-the-art technique for sequential modeling [5, 13]. Albeit a plethora [3] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al. 1994. Learning long-term
of research in the NLP domain [2, 19], extending interpretable RNNs dependencies with gradient descent is difficult. IEEE transactions on neural
for other real world applications is still an emerging field of research. networks 5, 2 (1994), 157–166.
[4] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
In a recent work, [18] predict the engagement of users in the Snap [5] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan
Chat app by capturing the in-app action transition patterns as a Liu. 2018. Recurrent neural networks for multivariate time series with missing
temporally evolving action graph. [17] develop an interpretable values. Scientific reports 8, 1 (2018), 6085.
[6] Justin Cheng, Caroline Lo, and Jure Leskovec. 2017. Predicting intent using
LSTM to learn multi-level graph structures in a progressive and activity logs: How goal specificity and temporal range affect user behavior. In
stochastic manner. [20] propose a dual stage attention model for Proceedings of the 26th International Conference on World Wide Web Companion.
International World Wide Web Conferences Steering Committee, 593–601.
medical diagnostics such as heart failure prediction. Zhou et. al. [7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,
[28] propose an attention-based RNN that predicts the purchase Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase
probability of users for targeted ads. Albeit having a similar NN representations using RNN encoder-decoder for statistical machine translation.
arXiv preprint arXiv:1406.1078 (2014).
architecture as ours, their problem is quite different than churn [8] Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. 2016. Using
prediction. Additionally, they modeling of local and global atten- recurrent neural network models for early detection of heart failure onset. Journal
tion is quite different than ours. To the best of our knowledge, the of the American Medical Informatics Association 24, 2 (2016), 361–370.
[9] Giovanni Luca Ciampaglia and Dario Taraborelli. 2015. MoodBar: Increasing new
only research that closely resembles our work is the churn predic- user retention in Wikipedia through lightweight socialization. In Proceedings
tion model proposed by Yang et. al. [26]. However, the attention of the 18th ACM Conference on Computer Supported Cooperative Work & Social
Computing. ACM, 734–742.
mechanism used in their work is quite different than ours. [10] Gideon Dror, Dan Pelleg, Oleg Rokhlenko, and Idan Szpektor. 2012. Churn
prediction in new users of Yahoo! answers. In Proceedings of the 21st International
Conference on World Wide Web. ACM, 829–834.
[11] Şule Gündüz and M Tamer Özsu. 2003. A web page prediction model based on
7 CONCLUSION click-stream tree representation of user behavior. In Proceedings of the ninth ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM,
In this paper we proposed interpretable recurrent neural network 535–540.
based models for prediction churn in over the top media (OTT) [12] Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: user
devices. In the first part of the paper, we analyzed the behavioral behavior as a predictor of a successful search. In Proceedings of the third ACM
international conference on Web search and data mining. ACM, 221–230.
characteristics of users and found that they can be categorized into [13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
four main types: mid bloomer, late bloomers, potential churners computation 9, 8 (1997), 1735–1780.
[14] Selim Ickin, Katarzyna Wac, Markus Fiedler, Lucjan Janowski, Jin-Hyuk Hong,
and consistent users. In the second part, we introduced two models and Anind K Dey. 2012. Factors influencing quality of experience of commonly
for churn prediction, namely Attention LSTM (ALSTM) and Neural used mobile applications. IEEE Communications Magazine 50, 4 (2012), 48–56.
Churn Prediction Model (NCPM). In ALSTM, the prediction of [15] Marcel Karnstedt, Matthew Rowe, Jeffrey Chan, Harith Alani, and Conor Hayes.
2011. The effect of user features on churn in social networks. In Proceedings of
churn was done by weighting on individual time frames (temporal- the 3rd International Web Science Conference. ACM, 23.
level attention) and (2) NCPM, we used two levels of attentions [16] Jaya Kawale, Aditya Pal, and Jaideep Srivastava. 2009. Churn prediction in
namely, feature-level and temporal-level. We showed that NCPM MMORPGs: A social influence based approach. In 2009 International Conference
on Computational Science and Engineering, Vol. 4. IEEE, 423–428.
outperforms all other models over a wide range of test cases and [17] Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, and Eric P
achieves an accuracy of upto 89% and AUC of 92%. Xing. 2017. Interpretable structure-evolving LSTM. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 1010–1019.
User Modeling and Churn Prediction in
Over-the-top Media Services ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil
[18] Yozen Liu, Xiaolin Shi, Lucas Pierce, and Xiang Ren. 2019. Characterizing [24] Narayanan Sadagopan and Jie Li. 2008. Characterizing typical and atypical user
and Forecasting User Engagement with In-app Action Graph: A Case Study sessions in clickstreams. In Proceedings of the 17th international conference on
of Snapchat. arXiv preprint arXiv:1906.00355 (2019). World Wide Web. ACM, 885–894.
[19] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khu- [25] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
danpur. 2010. Recurrent neural network based language model. In Eleventh with neural networks. In Advances in neural information processing systems. 3104–
annual conference of the international speech communication association. 3112.
[20] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison [26] Carl Yang, Xiaolin Shi, Luo Jie, and Jiawei Han. 2018. I Know You’ll Be Back:
Cottrell. 2017. A dual-stage attention-based recurrent neural network for time Interpretable New User Clustering and Churn Prediction on a Mobile Social
series prediction. arXiv preprint arXiv:1704.02971 (2017). Application. In Proceedings of the 24th ACM SIGKDD International Conference on
[21] Vineeth Rakesh, Niranjan Jadhav, Alexander Kotov, and Chandan K Reddy. 2017. Knowledge Discovery & Data Mining. ACM, 914–922.
Probabilistic social sequential model for tour recommendation. In Proceedings of [27] Igor Zakhlebin, Em Horvát, et al. 2019. Investor Retention in Equity Crowdfund-
the Tenth ACM International Conference on Web Search and Data Mining. ACM, ing. In Proceedings of the 10th ACM Conference on Web Science. ACM, 343–351.
631–640. [28] Yichao Zhou, Shaunak Mishra, Jelena Gligorijevic, Tarun Bhatia, and Narayan
[22] Vineeth Rakesh, Wang-Chien Lee, and Chandan K Reddy. 2016. Probabilistic Bhamidipati. 2019. Understanding Consumer Journey using Attention based
group recommendation model for crowdfunding domains. In Proceedings of the Recurrent Neural Networks. In Proceedings of the 25th ACM SIGKDD International
Ninth ACM International Conference on Web Search and Data Mining. ACM, 257– Conference on Knowledge Discovery & Data Mining. ACM, 3102–3111.
266. [29] Agustin Zuniga, Huber Flores, Eemil Lagerspetz, Petteri Nurmi, Sasu Tarkoma,
[23] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and Pan Hui, and Jukka Manner. 2019. Tortoise or Hare? Quantifying the Effects of
validation of cluster analysis. Journal of computational and applied mathematics Performance on Mobile App Retention. In The World Wide Web Conference. ACM,
20 (1987), 53–65. 2517–2528.