User Modeling and Churn Prediction in Over-the-top Media Services Vineeth Rakesh Ajith Pudiyavitil∗ Jaideep Chandrashekar Interdigital AI Lab, USA Lowe’s, USA Interdigital AI Lab, USA vineeth.mohan@interdigital.com ajithkp12@gmail.com jaideep.chandrashekar@interdigital.com ABSTRACT stops using the service. By examining high level data collected on We address the problem of customer retention (churn) in applica- one such OTT hardware platform, we propose feature engineering tions installed on over the top (OTT) streaming devices. In the first techniques for modeling user behavior and leverage these features part of our work, we analyze various behavioral characteristics to develop application-level churn prediction models. Specifically, of users that drive application usage. By examining a variety of given a user u who installs an application a at a given time on their statistical measures, we answer the following questions: (1) how do OTT device, our model predicts whether u will be engaged (or not users allocate time across various applications?, (2) how consistently engaged) with a after a particular time window. do users engage with their devices? and (3) how likely are dormant Users may decide to abandon a streaming service for any number users liable to becoming active again? In the second part, we leverage of reasons such as a limited time budget to consume content, an these insights to design interpretable churn prediction models that increasing affinity for a different application or a lack of compelling learn the latent characteristics of users by prioritizing the specifica- new content. Yet another reason may be that the user experiences tions of the users. Specifically, we propose the following models: more hardware faults (i.e. reboots, poor wifi, high memory us- (1) Attention LSTM (ALSTM), where churn prediction is done using age, etc.) when a particular service is being used. In such cases, a single level of attention by weighting on individual time frames the user perceives these faults as being caused by the application (temporal-level attention) and (2) Neural Churn Prediction Model and quits out of frustration. The key observation here is that both (NCPM), a more comprehensive model that uses two levels of atten- application-level and device-level behavior can influence user churn. tions, one for measuring the temporality of each feature and another Consequently, it is critical to model these heterogeneous factors to measure the influence across features (feature-level attention). along with temporally correlated features reflecting usage of differ- Using a series of experiments, we show that our models provide ent applications to accurately predict churn. To achieve this, we first good churn prediction accuracy with interpretable reasoning. We analyze a dataset that captures high level events on OTT devices believe that the data analysis, feature engineering and modeling and examine the signals of application churn. These events could be techniques presented in this work can help organizations better user-initiated (i.e. opening or closing an application, restarting the understand the reason behind user churn on OTT devices. device, putting the device to sleep, etc.) or device-specific events (i.e. automatic reboots, wifi drops, software/firmware updates, etc.) Reference Format: With this data, we examine questions such as: (1) how users allocate Vineeth Rakesh, Ajith Pudiyavitil, and Jaideep Chandrashekar. 2020. User Modeling and Churn Prediction in Over-the-top Media Services. In 3rd time across applications on their OTT device, (2) how often and for Workshop on Online Recommender Systems and User Modeling (ORSUM 2020), how long users engage with the device (specific applications), and in conjunction with the 14th ACM Conference on Recommender Systems, (3) how long do users go dormant, and how likely is it for a dormant September 25th, 2020, Virtual Event, Brazil. user to become reactive. In the second part of our work, we leverage these statistical insights to design interpretable models that are 1 INTRODUCTION effective in predicting churn across a wide range of scenarios. In recent years, users have increasingly taken to consuming stream- The naive approach to building a model that predicts churn ing video services via applications (i.e. Netflix, Hulu, Youtube, etc.) would be to fix an observation time window T , extract a number of on so called over the top (OTT) platforms (i.e. AppleTV, Roku, features of interest |m| from this window and then deploy a suitable Amazon FireTV, etc.). Given the very large (and still growing) num- classification algorithm that predicts whether a subscriber will quit ber of streaming services, there is fierce competition to attract a service after a period of time T . While this is entirely viable, there new customers, while maintaining customer satisfaction. Unfortu- are two important drawbacks. First, the data is inherently noisy nately, there is significant cost to attracting new users; thus, service and high-dimensional; OTT devices send out periodic device-level providers are very invested in retaining end-users and keeping them and application level summaries (the start time and duration of an engaged with their products. These customer retention efforts fo- application session), and events observed on the box. This results cus on providing exclusive and engaging content, personalized in a feature space of size T × |m|. Second, there is significant inher- recommendations and intuitive user interfaces. When such efforts ent temporal correlation in the data; if a user spends a significant fail, the operators experience customer churn, wherein a subscriber amount of time inside an application on successive days, there is a strong signal that he/she will engage with the same application on ∗ This work was done when the author was at Interdigital AI Lab. the next day. Flattening data over the entire window T into a single ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil representation vector will lead to this information being lost. An Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons alternative, more principled approach, is to learn latent attributes License Attribution 4.0 International (CC BY 4.0). ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Vineeth Rakesh, et al. of the data using time-series models such as recurrent neural net- AndroidTV based OTT devices deployed in homes and operated works (RNN) [3] and use them as features for churn prediction. A by a large provider. Each of these devices comes pre-installed with potential issue with this approach is that the compressed latent a set of applications; apart from these, users (u) can download vector is inefficient at capturing all the necessary information that others from a large catalog on the app store. We observed over leads to churn. Furthermore, it is extremely difficult to interpret 3.7k distinct applications being used in the dataset. However, some the results of vanilla RNNs. devices and applications are used very sporadically and account In this paper, we propose a model that address the drawbacks of for very little data. To remove these, we pre-processed the data the more conventional approaches. First, we introduce Attention to filter out devices that were active for less than 90 days in total, LSTM (ALSTM), where we modify the neural machine transla- and removed applications that were used on fewer than 15 devices tion (NMT) model [2] for churn prediction; ALSTM models the in our population. This filtering resulted in 14,082 unique users local attention. Second, we propose Neural Churn Prediction Model (i.e., OTT devices) 1 and 462 unique apps; this dataset is denoted (NCPM), which incorporates two levels of attention i.e., local and by D. Note that the churn prediction model builds features over global. ALSTM uses temporal-level attention (local attention) where the lifetime of the application on the device; this starts from the different (sub) observation windows contribute different weights time the application was installed, to when the user is deemed to have towards predicting churn. For example, in a particular week we abandoned it. Unfortunately, for the pre-installed applications such might observe the subscriber slowly starting to watch more and as Netflix, YouTube and SlingTV there is no install date. One can more content on a new application, and spending less and less time exclude such apps from the data; however, this leads to removing a in another in which they were previously engaged (and eventu- significant number of users. This is because a large proportion of ally abandon). Thus, features collected in this week might require users tend to confine themselves to using the pre-installed apps (which higher priority when compared to other time frames. NCPM on also happens to be the popular streaming services). Alternatively, the other hand is a more comprehensive model that captures each having all users in a single bin could lead to some serious bias in feature using a separate ALSTM. The individual ALSTMs are then our modeling since for default apps, we do not have a clear signal combined with a feature-level attention layer (global attention). on when it was downloaded. The user might have been using the Global attentions are much better at prioritizing weights across app well before the onset of our study. Therefore, besides D, we different temporal features. For example, churning could be more create a separate dataset De that completely excludes the default influenced by device-level issues such as periodic reboots rather apps, while for (u, a) ∈ D that do not have a start date, we simply than application engagement. Although attention-based RNNs have take the first log entry of a by u as a proxy for the actual install been extensively used in the field of natural language processing date. The subset De covers 8,223 unique devices using 397 distinct [7, 25], they have very rarely been applied to the problem of churn applications. prediction. To our knowledge, the only work that appears to ad- Clearly, the earlier a service provider is able to predict churn dress this area is [26]; however, the attention mechanism used in (of u), the more effectively they can take action and address the their work is quite different than ours. We summarize the major underlying reasons as to why a customer might be departing. Con- contributions of our work as follows: sequently, we divide D and D e into different days of activities A, • Understanding user behavior: Through data engineering and where A = {t |t ≤ T ,T ∈ {5, 10, 15, 20}}. For instance, T = 5 means statistical analysis, we provide several insights that explain the we consider a maximum of five days of user activity to predict behavior of users in our OTT dataset. the churn. Table 1 shows the characteristics of dataset D and De • Predicting Churn: We propose attention-based RNN models across different activity days. In the upcoming section, we explain that learn the characteristics of users in a weighted low-dimensional the methodology of determining the churners and non-churners latent space. (i.e., columns four and five in Table 1). Here, one can see that as T • State-of-the-art performance: By conducting extensive exper- increases, the number of users decrease. This is because the number iments on a real-world dataset, we show that NCPM outperforms of users that continuously use the OTT for say 20 days is far less all other models over different test cases and achieves an accu- than than those who use for just 5 days. racy of upto 89% and AUC of 92%. Additionally, NCPM interprets the reason for churning by emphasizing on features such as inter- 3 ANALYZING USER CHARACTERISTICS arrival time between the apps and consistency in app usage. App and Device Usage: Figure 1 (a) shows the ten most popular We begin by introducing our dataset in Section 2 and in Section 3, apps seen in our dataset, based on the number of users that regularly we model the behavior of OTT customers as observed in this dataset. use them. Sling Tv, Netflix, Youtube and Google games occupy the The churn prediction models ALSTM and NCPM are proposed in top four spots. Figure 1 (b) plots the distribution of daily time spent Section 4 followed by the results of our experiments in Section 5. inside each of the applications. We find that users spend 3-4 hours Finally, we review related work in Section 6 and conclude our paper on average with the OTT device, with a very small fraction of in Section 7. users that spend more than 8 hours. In fig. 1 (c), we break this daily spend into four different parts of a day, corrsponding to morning (5am-12pm), afternoon (12pm-5pm), evening (5pm-10pm) and night 2 DATASET (10pm-5am). Unsurprisingly, we observe that users tend to spend Our dataset consists of high level application session data (start more time in the evening when compared to other periods of a time and durations) and device level events that spans from Sept 2018 to April 2019. The data was collected from a sample of 31k 1 we use the words devices and users interchangeably User Modeling and Churn Prediction in Over-the-top Media Services ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Dataset D Dataset De T #Users #Apps #Churns #NonChurns #Users #Apps #Churns #NonChurns 5 13082 402 9017 16044 6390 283 8712 9421 10 7954 347 6339 8596 5500 234 4661 7538 15 5440 256 4123 5932 4929 194 3088 6423 20 5009 179 3193 4067 4453 165 2231 5548 Table 1: Statistics of the churn prediction datasets D and De for different range of activity days T . For example, T = 5 implies for a given user-app (u, a) tuple, u actively used a upto 5 days before churning. day (median of 2.2 hrs). However, it is not significantly higher explained by the plot, these users start using the device heavily at than afternoon, which has a median of 1.8 hrs and morning with first, but then stop using the box for various reasons. Please note a median of 1.4 hrs. Another key statistic of interest is the inter that since y-axis is the CDF, the flat line here indicates minimal or arrival time between application sessions. Note that there may not no activity (also indicated by very low standard deviation, since be an explicit indicator of a user quitting an application. Very often, there is no activity). (4) consistent users (d), finally, these are users users just stop using the application that is installed, or deactivate who are consistent and regularly use their OTT boxes to watch their accounts but keep the application installed. Thus, churn must different shows. be detected implicitly, i.e., by the fact of the application not being Understanding Churning Behavior: Before introducing our pre- started by the user for a sufficiently long time. We calculate the diction models, we briefly explain how we label a user (or device) arrival time between successive start times of an application session having churned, i.e., left the service. This is fundamentally a difficult on a device (across all applications) and plot the maximum values, task because there is no explicit signal for this behavior. Further across all the devices, in fig. 1 (d). We observe that users, after an complicating things, (a) some users don’t use the app for a few absence, return to the OTT applications within a median time of days, but return back after a brief period of inactivity, and (b) some 6 days (100-200 hours). The 75%-ile value of this distribution is users simply download the app once (or spend a brief amount of about 10 days. Later in this section, we leverage this to establish an time in it) and never use it again. Figure 3 depics, at a high level, inactivity threshold when we define churn more precisely. all the information for a user (u) application (a). The dotted lines at User Engagement Patterns: Here we try to understand how users either end capture the time data was collected and each of the green spend time on their OTT device. We wish to explore the following vertical lines in the middle indicate the start of application sessions aspects: (a) Are there users who consistently use the box for same (a 1 is the first session, an is the last). Here, we see that the applica- number of hours every day? (b) Are there dormant users who don’t tion was downloaded after the start of the data collection and used use their device for a while, but then reactivate it? (c) Are there several times, the last instance is at t 3 . Somewhat infrequently, we users that engage with their device, but only intermittently and for see the device itself disappears from the dataset; we consider this a brief periods of time? To answer these, we carry out the following signal that the user has disconnected the device and is no longer analysis. First, for each day that our dataset spans, we compute the using it. In this scenario, we capture this event having occured at t 4 . cumulative time (in terms of cummulative distribution function - With this depiction, we can now define churn in very specific terms CDF) that the user engaged with the device. Specifically, we com- by addressing the two challenges previously discussed. First, we pute (i, c i ) for each user, where i = 1, 2, .., |D| represents each day require that the application not be used for a period of time after the in our dataset, and c i is the total number of hours spent on the last use. Following the example in fig. 3, we impose the condition device upto day i. Next, we carry out a non-linear fit on this data ∆2 ≥ T3q . Here T3q – an inactivity threshold – is the 3rd quartile for each user, recording the learned slope, intercept and standard of the distribution in fig. 1 (d) and turns out to be 10 days. Second, error as derived features for each user. Subsequently, we cluster we require a minimum number of sessions to be recorded for an the derived features using K-means; the number of clusters is de- application and user. Specifically, u should have engaged with a at termined based on the silhouette score [23]. This analysis yields least K times, i.e., n ≥ K and we set K = 3. Figure 4 illustrates the four main behavior patterns that cover the vast majority of users, characteristics of devices that are exclusively labeled as churn. We and are depicted in fig. 2. Each plot is based on the original data of see that the median inter-arrival times for the top 5 apps is around cumulative device engagement time (x-axis is days elapsed, y-axis 30 days (Figure 4 (b)), which is significantly higher than the generic is cumulative time spent). These four patterns can be labeled as inter-arrival characteristics shown in Figure 1 (d). In figure 4 (c) follows: (1) mid bloomers (a), these are users who are initially silent we notice that users who download more apps tend to have higher and do not use the OTT box heavily, but then suddenly start using churning rate, we obtained a Pearson correlation coefficient of 0.67. during the middle phase of their total period. (2) late bloomers (b), It is also interesting to observe that as the churn increases, users these users remain dormant for a longer duration with minimal tend to switch between apps more frequently, where the app switch activity, but suddenly start using the device towards the end. (3) po- is indicated by the session feature (y-axis). tential churners (c), these are users who are of interest to us. As ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Vineeth Rakesh, et al. (a) (b) (c) (d) Figure 1: Generic characteristics of users in D (a) shows the top 10 most frequently used apps, (b) majority of the users spend 2-6hrs per day, (c) users tend to spend more time in the evening and (d) majority of the users tend to return back to their OTT device within a maximum of 8-9 days (≈ 200 hrs) Figure 2: The four types of users captured by our clustering framework. Clockwise from top left, (a) mid-bloomers, (b) late- bloomers, (c) potential-churners and (d) consistent users. The error bars show the standard deviation based on usage. 4 PREDICTING CHURN Second, not only should we predict the churn with good accuracy, Given a user u and an app a, our objective is to predict if u will but also produce highly interpretable results. In other words, we continue or stop using a. We realize each user-app entity as a tuple should reason out as to why a user is churning. To achieve this, we (X , M x , Y ) where X = {x 1 , ..., x t } is a stream of events (or logs) propose the following models: (1) attention LSTM (ALSTM), which that spans a time t ∈ T . Each event x comprises of M features, is a simple modification of the neural machine translation (NMT) and Y = {y1 , ..., yt } are the binary labels that indicate churn (or model [2] and (2) neural churn prediction model (NCPM): a more non-churn) at t. When designing our churn prediction model we comprehensive model that incorporates temporal-level attention had two main objectives. First, since our data is highly temporal, it (or local attention) and feature-level attention (or global attention). is important to learn the latent characteristics of churners (and non- Both the models are based on recurrent neural network (RNNs) churners) in such a way that it embeds the temporality of events. that have shown to be effective in modeling time-series data [5, 8]. RNNs take a series of temporally dependent inputs and learn their User Modeling and Churn Prediction in Over-the-top Media Services ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Figure 3: High level summary of user u interacting with application a (a) (b) (c) (d) Figure 4: Characteristics of churning OTT devices (a) Most frequently churned applications for the dataset D e (b) the median inter-arrival is about 25-30 days for top-5 churning apps, (c) users tend to churn more when they download more apps and (d) users who churn for more apps tend to switch between apps more frequently. latent representation (or hidden state vector) using the following Attention LSTM (ALSTM): Obviously one can predict churn by expression: simply providing the input X to a vanilla LSTM, get the latent vectors h from the final layer, and use them as features for prediction. ht = f (ht −1 , x t ) (1) A potential issue with this approach is that the compressed (or where ht is the hidden layer at time t and f is some non linear low dimensional) latent vector h is inefficient in capturing all the function. For our application, we model f using long short-term necessary information that attributes to churn. As explained in memory network (LSTM) [13]. LSTM has four states that are defined Section 1, in a particular week we might observe the subscriber as follows: slowly starting to navigate towards a new application and spend less and less time in one that they were previously engaged with (and i t = σ (Wi · [ht −1 ; x t ] + bi ) (2) eventually abandon). So, it is important to give high priority to these ft = σ (Wf · [ht −1 ; x t ] + bf ) time windows when compared to other weeks. Modeling churn using vanilla LSTM networks fails to prioritize such key events. c t = ft × c t −1 + i t × tanh(Wc · [ht −1 ; x t ] + bc ) Inspired by recent developments in neural machine translation ot = σ (Wo · [ht −1 , x t ] + bo ) (NMT) [2], we incorporate attentions into LSTM to overcome this ht = ot × tanh(c t ) issue. Since our application is very different from natural language processing, we introduce two modifications over NMT. First, we where t is the time step (i.e, days) , ht is the hidden state at t , c t replace the decoder part with a single layer neural network (NN) is the cell state at t , x t is the hidden state of the previous layer at with sigmoid activation for churn prediction and second, we change time t, i t , ft , ot are the input, forget and out gates, respectively. ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Vineeth Rakesh, et al. Figure 5: The end-to-end architecture of the proposed models (a) ALSTM that uses attentions on a single LSTM network to model churn (b) shows NCPM that uses separate LSTM networks for modeling individual features and predicts the churn using weighted attentions. the attention mechanism to suit our problem. The proposed ALSTM In the above expression, β denotes the individual attention weights model is shown in Figure 5 (a). Here, the attention block A outputs that is defined as follows: a vector of weights α that emphasizes the importance of the latent exp(c m ) vector h for a given time frame t. The weighted latent vector p is β m = PM (6) m defined as follows: m=1 exp(c ) Z cm X j = z i Ui j (7) i=1 T where z = p 1 ⊕ {p i }m 2 is the concatenation (indicated by ⊕) of the X p= α t ht (3) t =1 feature-level latent vectors . Finally, to predict the churn, a linear projection with a sigmoid function is connected to the output of the last layer to produce user churn prediction as follows: where the weight α j for a time instance t is defined by ŷ = σ (Wд · д + bд ) (8) exp(s j ) α j = PT (4) The loss for both ALSTM and NCPM is computed using binary t =1 exp(s t ) cross entropy, that is defined as follows: K X hkt · Wkt j L= −yi loд(ŷi ) − (1 − yi )loд(1 − ŷi ) (9) X sj = i k =1 5 EXPERIMENTS Neural churn prediction model (NCPM): One drawback of AL- Obviously, from Table 1, one can notice that our dataset is biased STM is that it is unable to prioritize across features. For example, towards negative samples (i.e., #non-churns). Therefore, to create a churning could be more influenced by the consistency of users (see balanced dataset, for every positive data point (i.e, churns) for an Figure 2), while the number of downloads might have little impor- app a, we randomly sample a corresponding negative data point. tance. To overcome this problem, we incorporate both temporal- We test our models by varying the number of days in the training level attention and feature-level attention. As depicted in Figure sample (explained in Section 2). This helps us to see how quickly 5, instead of treating the features as a single vector, we decouple can our models predict the churn. For all our experiments, we use the features and model them using individual ALSTMs. Similar 10 fold cross validation, where eight folds are used for training, one to ALSTM, the attention block pm of an a feature m captures the for validation, and one for testing. The deep learning models are influence (or weight) of the latent features from different slices of implemented using Keras with Tensorflow as the back-end. time. On the other hand, the feature-level attention is capture by the block B, which is defined by the following expressions: 5.1 Baselines We compare the performance of the proposed ALSTM and NCPM with three baseline methods. Unlike the proposed models (i.e., AL- M STM and NCPM) the following baselines do not capture the tem- β m pm X д= (5) porality in data. Therefore, the inputs to these model are flattened m=1 vectors across the time frames. User Modeling and Churn Prediction in Over-the-top Media Services ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Logistic Regression: the classic model for binary classification corresponding AUC values are listed in Table 4, due to space con- problem. Albeit simplistic, it helps us to understand if a linear straints, only the non-default case is furnished. Similar to the ac- decision boundary is sufficient to capture the churners and the curacy scores, for most scenarios, NCPM remains dominant over non-churners. We use L2 norm as the regularizer and Stochastic other models. We also notice that RF tends to perform better than Average Gradient (SAG) as the solver. NCPM and ALSTM when the temporal length of data is low (i.e., Multi-layer Perceptron (MLP): We consider a simple two layer with just five days). However, as we incorporate more days for neural network and a dropout layer to avoid overfitting. A linear training, there is a clear increase in the performance of our models. projection with a sigmoid function is connected to the output of The outcome for dataset D is much different than De where we the last layer to produce user churn. Similar to logistic regression, are able to achieve an AUC of almost 89% with just five days of MLP does not capture the temporal dependencies between the data. data; additionally, ALSTM seems to perform very similar to NCPM. The number of neurons are set to 80 for each intermediate layer. Interpreting the churn prediction: One of the key strengths of Random Forest (RF): Despite the rapid advancements in the field our model is interpretability. As explained in Section 1, ALSTM of deep learning, ensemble techniques such as RF [4] remain highly provides single level of interpretability, which indicates which days competitive in producing excellent results on data with several are important when predicting churn. NCPM on the other hand, has modalities. In our experiments, the number of decision trees are set two levels; besides telling the important days, it also tells us which as 50 and the maximum depth as 10. features are important. We present the interpretability scores as heatmaps in Figure 8. Due to the lack of space, we only furnish the results of non-continuous dataset. Heat maps (a)-(d) explains that 5.2 Results the influence of features are not uniform across apps; for instance, Classification accuracy: Tables 2-3 shows that NCPM outper- when we have less days for prediction, the churn is influenced forms all other models for both datasets De and D, achieving an by two main attributes namely, the number of downloads and the accuracy of upto 92%. As we increase the number of days the accu- cluster types (Figures 8 (a) and (c)). As we incorporate more data racy increases for all models (except logistic regression). Here, CA-5 for training (i.e., the number of days) the attention tends to get implies the classification accuracy with just 5 days of data, while more focused towards a few key features. For non-default apps, CA-20 implies 20 days of data. We can also see that the proposed there seems to be more attention on the inter-arrival time, while ALSTM is not as good as NCPM which proves the following: (1) it for all-apps the influence seems to be more towards the number is important to learn the latent attributes of each individual features of reboots. It could be possible that these apps experience a higher separately and (2) incorporating both global and local attention number of app crashes, which could lead to user rebooting the is necessary. That being said, ALSTM clearly outperforms MLP, device. For non-default apps, Hulu, Plot Tv and Kodi is heavily which emphasizes the necessity of learning the temporal actions of influenced by the cluster id feature that we engineered in Section 3. OTT users. The worst performing model is the logistic regression, When it comes to temporal attention (Figures 8 (e)-(h)), for dataset which is just slightly better than a random selection. This illustrates D, the influence is mainly concentrated on a few selective days, the difficulty of our churn prediction task. The performance of RF is i.e., day 4 for non-continuous case, and day 2 for continuous case. very close to that of ALSTM, which indicates that ensemble models Contrary to this, for D e this influence is spread across almost all are still a strong candidate for our problem. days. In general, the performance of models over non-continuous data In Section 3 we explained that the engagement pattern of users is much better than its continuous counterpart, this can be explained could have a strong impact on churn. To show this effect, for each using the following example. let us say that u uses an app for user, we get the final attention score from individual RNNs of six days before churning and we have the following data for u NCPM and plot the outcome in Figure 9 (a). Here, we can see {m 1 , m 4 , m 6 , m 7 , m 8 , m 10 }, where m is some feature and the suffix that consistent users are the highest indicators of non-churn, while indicates the day. Our objective is to predict the outcome on sixth potential churners are the highest indicators of churn. Interestingly, day, using the first five days. Since the user does not use the OTT box mid bloomers seem to have higher attention over late bloomers for days two, three, and five, the continuous data that is fed to our when it comes to predicting non-churners, while the opposite is models (i.e., both NCPM and ALSTM) is essentially a sparse vector true for churners. Figure 9 (b) provides a more deeper look into this {m 1 , 0, 0, m 4 , 0}, which has several missing values. On the contrary, outcome by emphasizing on the importance of temporal progression for non-continuous dataset, we will have the actual usage values on the user types. Unsurprisingly, during the initial phase (elapsed for five days. This obviously means that the model gets to train duration of 10-20%) almost all types have less attention weights. with more observed data points, which leads to better prediction This is because, during the early phase, we do not have enough accuracy. Another interesting observation is that the performance data about the user type. As the time progresses, around 20-50% of all models (except logistic regression) is noticeably better on the of the elapsed duration, we see that potential churners have the all-apps dataset. One key reason for this outcome is the popularity strongest impact on the outcome followed by consistent users and of the default apps. Apps such as Sling TV, Netflix and Youtube late bloomers. Around 50-80% , the impact of potential churners are significantly popular than other non-default apps. Therefore, drastically reduces, while mid and late bloomers increase. At the the models are able to effectively learn the churn patterns for such final stage (i.e., 80-100%) almost all user types have less importance apps more effectively. (or attention). This is because, during the last phase, there is more AUC and ROC characteristics: Figures 6 and 7 compare the ROC available data in the form of other features such as number of characteristics of the proposed models with other baselines. The ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Vineeth Rakesh, et al. (a) (b) Model CA-5 CA-10 CA-15 CA-20 Model CA-5 CA-10 CA-15 CA-20 Logistic 0.56 0.55 0.54 0.54 Logistic 0.56 0.55 0.54 0.54 RF 0.65 0.73 0.79 0.80 RF 0.66 0.70 0.73 0.75 MLP 0.6 0.66 0.72 0.78 MLP 0.58 0.63 0.70 0.74 ALSTM 0.65 0.77 0.83 0.86 ALSTM 0.60 0.67 0.74 0.79 NCPM 0.67 0.79 0.84 0.88 NCPM 0.62 0.70 0.76 0.81 Table 2: Classification accuracy (CA) for (a) non-default and non-continuous data across 5-20 days and (b) non-default and continuous data across 5-20 days. (a) (b) Model CA-5 CA-10 CA-15 CA-20 Model CA-5 CA-10 CA-15 CA-20 Logistic 0.56 0.55 0.54 0.54 Logistic 0.56 0.55 0.54 0.54 RF 0.73 0.78 0.85 0.89 RF 0.72 0.78 0.78 0.83 MLP 0.7 0.72 0.73 0.74 MLP 0.62 0.70 0.71 0.74 ALSTM 0.78 0.83 0.87 0.91 ALSTM 0.74 0.79 0.83 0.86 NCPM 0.78 0.84 0.89 0.92 NCPM 0.76 0.82 0.84 0.89 Table 3: Classification accuracy (CA) for (a) all apps and non-continuous data across 5-20 days and (b) all apps and continuous data across 5-20 days. (a) 5 days (b) 10 days (c) 15 days (d) 20 days (e) 5 days (f) 10 days (g) 15 days (h) 20 days Figure 6: Receiver operating characteristic curve (ROC) of the proposed ALSTM and NCPM model along for non-default apps. Curves (a)-(d) represent the ROC for non-continuous dataset and (e)-(h) represents the continuous dataset. downloads and inter-arrival time between apps. Consequentially, churn is still at infancy. Au et. al. [1] adopt a rule based learning the model is able to rely on better indicators at the later stage. technique for early churn prediction. In [29], the authors tackle the problem of churn prediction in mobile apps. They find that 6 RELATED WORK application performance such as energy consumption and latency The problem tackled in this paper is related to the following topics: have a significant impact on retention. [16] use a social influence (1) churn prediction (2) user behavior modeling and (3) interpretable based approach for churn prediction. Recently, [26] develop an in- neural networks. We now detail some existing research that corre- terpretable framework that constraints the objective of RNN with spond to these topics. the outcome of K-means clustering to predict the retention of users Churn Prediction: User retention (or churn) has been extensively in Snap Chat. studied in the field of social computing and human computer inter- Modeling user behavior: There are numerous research on be- action (HCI) [9, 14, 27]. However, developing predictive models for havior modeling [10, 15, 21]. For example, [6] predict user intents User Modeling and Churn Prediction in Over-the-top Media Services ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil (a) (b) Model AUC-5 AUC-10 AUC-15 AUC-20 Model AUC-5 AUC-10 AUC-15 AUC-20 Logistic 0.56 0.55 0.55 0.55 Logistic 0.55 0.53 0.53 0.54 RF 0.71 0.81 0.87 0.89 RF 0.71 0.77 0.83 0.86 MLP 0.65 0.72 0.79 0.84 MLP 0.62 0.69 0.77 0.83 ALSTM 0.66 0.76 0.85 0.89 ALSTM 0.6 0.66 0.71 0.78 NCPM 0.72 0.85 0.9 0.91 NCPM 0.66 0.76 0.84 0.88 Table 4: Area under the ROC curve (AUC) for (a) non-default apps and non-continuous data across 5-20 days and (b) non-default apps and continuous data across 5-20 days. (a) 5 days (b) 10 days (c) 15 days (d) 20 days (e) 5 days (f) 10 days (g) 15 days (h) 20 days Figure 7: Receiver operating characteristic curve (ROC) of the proposed ALSTM and NCPM model along for all apps. Curves (a)-(d) represent the ROC for non-continuous dataset and (e)-(h) represents the continuous dataset. (a) De , 10 days (b) De , 20 days (c) D , 10 days (d) D , 20 days (e) De , 10 days (f) D , 20 days (g) De , 10 days (h) D , 20 days Figure 8: The feature-level (a-d) and the temporal-level (e-h) influences on churn prediction for non-continuous dataset. The gradient of colors denote the proabability scores, where red denotes the highest weight and green denotes the lowest influence. by leveraging the activity logs in Pinterest. [12] predict the like- Kickstarter crowdfunding domain using a heterogeneous combina- lihood of a successful search in web search queries. They show tion of social communities, popularity of projects and the impact that user behavior are more predictive of goal success than those of reward categories. Studies such as [11, 24] and [21] model user using document relevance.[22] model the behavior of users in the behavior from sequencial actions such as click streams and social ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Vineeth Rakesh, et al. (a) (b) Figure 9: The attention weights of mid bloomers (MB), late bloomers (LB), potential churners (PC), and consistent users (CU): (a) indicates the overall attention-scores during the final phase of prediction and (b) indicates the attention scores at different stages of temporal progression. network activities. [24] use a combination of Mahalanobis distance REFERENCES (for detecting outlines) and Markov Chains to model sessions in [1] Wai-Ho Au, Keith CC Chan, and Xin Yao. 2003. A novel evolutionary data mining click streams, while [21] use a temporal LDA based approach for algorithm with applications to churn prediction. IEEE transactions on evolutionary computation 7, 6 (2003), 532–545. tour recommendation in Foursquare. [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- Interpretable Sequence Modeling: RNNs have become the state- chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014). of-the-art technique for sequential modeling [5, 13]. Albeit a plethora [3] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al. 1994. Learning long-term of research in the NLP domain [2, 19], extending interpretable RNNs dependencies with gradient descent is difficult. IEEE transactions on neural for other real world applications is still an emerging field of research. networks 5, 2 (1994), 157–166. [4] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32. In a recent work, [18] predict the engagement of users in the Snap [5] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Chat app by capturing the in-app action transition patterns as a Liu. 2018. Recurrent neural networks for multivariate time series with missing temporally evolving action graph. [17] develop an interpretable values. Scientific reports 8, 1 (2018), 6085. [6] Justin Cheng, Caroline Lo, and Jure Leskovec. 2017. Predicting intent using LSTM to learn multi-level graph structures in a progressive and activity logs: How goal specificity and temporal range affect user behavior. In stochastic manner. [20] propose a dual stage attention model for Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 593–601. medical diagnostics such as heart failure prediction. Zhou et. al. [7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, [28] propose an attention-based RNN that predicts the purchase Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase probability of users for targeted ads. Albeit having a similar NN representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014). architecture as ours, their problem is quite different than churn [8] Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. 2016. Using prediction. Additionally, they modeling of local and global atten- recurrent neural network models for early detection of heart failure onset. Journal tion is quite different than ours. To the best of our knowledge, the of the American Medical Informatics Association 24, 2 (2016), 361–370. [9] Giovanni Luca Ciampaglia and Dario Taraborelli. 2015. MoodBar: Increasing new only research that closely resembles our work is the churn predic- user retention in Wikipedia through lightweight socialization. In Proceedings tion model proposed by Yang et. al. [26]. However, the attention of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 734–742. mechanism used in their work is quite different than ours. [10] Gideon Dror, Dan Pelleg, Oleg Rokhlenko, and Idan Szpektor. 2012. Churn prediction in new users of Yahoo! answers. In Proceedings of the 21st International Conference on World Wide Web. ACM, 829–834. [11] Şule Gündüz and M Tamer Özsu. 2003. A web page prediction model based on 7 CONCLUSION click-stream tree representation of user behavior. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, In this paper we proposed interpretable recurrent neural network 535–540. based models for prediction churn in over the top media (OTT) [12] Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: user devices. In the first part of the paper, we analyzed the behavioral behavior as a predictor of a successful search. In Proceedings of the third ACM international conference on Web search and data mining. ACM, 221–230. characteristics of users and found that they can be categorized into [13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural four main types: mid bloomer, late bloomers, potential churners computation 9, 8 (1997), 1735–1780. [14] Selim Ickin, Katarzyna Wac, Markus Fiedler, Lucjan Janowski, Jin-Hyuk Hong, and consistent users. In the second part, we introduced two models and Anind K Dey. 2012. Factors influencing quality of experience of commonly for churn prediction, namely Attention LSTM (ALSTM) and Neural used mobile applications. IEEE Communications Magazine 50, 4 (2012), 48–56. Churn Prediction Model (NCPM). In ALSTM, the prediction of [15] Marcel Karnstedt, Matthew Rowe, Jeffrey Chan, Harith Alani, and Conor Hayes. 2011. The effect of user features on churn in social networks. In Proceedings of churn was done by weighting on individual time frames (temporal- the 3rd International Web Science Conference. ACM, 23. level attention) and (2) NCPM, we used two levels of attentions [16] Jaya Kawale, Aditya Pal, and Jaideep Srivastava. 2009. Churn prediction in namely, feature-level and temporal-level. We showed that NCPM MMORPGs: A social influence based approach. In 2009 International Conference on Computational Science and Engineering, Vol. 4. IEEE, 423–428. outperforms all other models over a wide range of test cases and [17] Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, and Eric P achieves an accuracy of upto 89% and AUC of 92%. Xing. 2017. Interpretable structure-evolving LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1010–1019. User Modeling and Churn Prediction in Over-the-top Media Services ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil [18] Yozen Liu, Xiaolin Shi, Lucas Pierce, and Xiang Ren. 2019. Characterizing [24] Narayanan Sadagopan and Jie Li. 2008. Characterizing typical and atypical user and Forecasting User Engagement with In-app Action Graph: A Case Study sessions in clickstreams. In Proceedings of the 17th international conference on of Snapchat. arXiv preprint arXiv:1906.00355 (2019). World Wide Web. ACM, 885–894. [19] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khu- [25] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning danpur. 2010. Recurrent neural network based language model. In Eleventh with neural networks. In Advances in neural information processing systems. 3104– annual conference of the international speech communication association. 3112. [20] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison [26] Carl Yang, Xiaolin Shi, Luo Jie, and Jiawei Han. 2018. I Know You’ll Be Back: Cottrell. 2017. A dual-stage attention-based recurrent neural network for time Interpretable New User Clustering and Churn Prediction on a Mobile Social series prediction. arXiv preprint arXiv:1704.02971 (2017). Application. In Proceedings of the 24th ACM SIGKDD International Conference on [21] Vineeth Rakesh, Niranjan Jadhav, Alexander Kotov, and Chandan K Reddy. 2017. Knowledge Discovery & Data Mining. ACM, 914–922. Probabilistic social sequential model for tour recommendation. In Proceedings of [27] Igor Zakhlebin, Em Horvát, et al. 2019. Investor Retention in Equity Crowdfund- the Tenth ACM International Conference on Web Search and Data Mining. ACM, ing. In Proceedings of the 10th ACM Conference on Web Science. ACM, 343–351. 631–640. [28] Yichao Zhou, Shaunak Mishra, Jelena Gligorijevic, Tarun Bhatia, and Narayan [22] Vineeth Rakesh, Wang-Chien Lee, and Chandan K Reddy. 2016. Probabilistic Bhamidipati. 2019. Understanding Consumer Journey using Attention based group recommendation model for crowdfunding domains. In Proceedings of the Recurrent Neural Networks. In Proceedings of the 25th ACM SIGKDD International Ninth ACM International Conference on Web Search and Data Mining. ACM, 257– Conference on Knowledge Discovery & Data Mining. ACM, 3102–3111. 266. [29] Agustin Zuniga, Huber Flores, Eemil Lagerspetz, Petteri Nurmi, Sasu Tarkoma, [23] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and Pan Hui, and Jukka Manner. 2019. Tortoise or Hare? Quantifying the Effects of validation of cluster analysis. Journal of computational and applied mathematics Performance on Mobile App Retention. In The World Wide Web Conference. ACM, 20 (1987), 53–65. 2517–2528.