Telecommunication customer churn prediction using machine learning methods Monika Zdanavičiūtė1,2, Rūta Juozaitienė1,2,3 and Tomas Krilavičius1,2 1 Vytautas Magnus University, Faculty of Informatics, Vileikos street 8, LT-44404 Kaunas, Lithuania 2 Centre for Applied Research and Development, Lithuania 3 Vilnius University, Vilnius, Lithuania Abstract These days telecommunication sector has grown significantly due to the use of smart technologies, and it is likely to continue to grow. The main resource of telecommunications companies is customers, but due to the relatively high level of competition in this field, most customers are not tied to a single service company. To understand the key factors contributing to customer churn rate, we have analysed the real data of one telecommunication company. The data from 2020-01-01 to 2022-03-07 consisted of information on 21128 users, 140970 payments and 350379 calls. The main contribution of our work was to develop a churn prediction model which identifies customers who are most likely subject to churn. We performed experiments using k-nearest neighbours, support vector machine, decision trees, random forest, naive Bayes classifiers and the Cox proportional hazard model with time-varying covariates. Results showed that the Cox regression model with time-varying covariates was superior to classical classification methods because it can take into account static user parameters and reflect their changes over time. Keywords Churn prediction, telecommunication churn, survival analysis, churn, telecommunications 1. Introduction a common business problem in this area. To re- tain customers, companies try to predict which con- These days telecommunication sector has grown sumers are going to leave in a variety of ways. significantly due to the use of smart technologies, A genetic algorithm has been proposed to iden- and it is likely to continue to grow. The main tify customers who intend to change their telecom- resource of telecommunications companies is cus- munications company in the near future [2]. The tomers, but due to the relatively high level of com- database used in the study consisted of 5250 cus- petition in this field, most customers are not tied tomers call data. Each user’s profile consisted of to a single service company. Therefore, in order to information about his behavior and habits (aver- create a successful business in this field, it is neces- age monthly costs for local and international calls, sary to know your client, his needs and opportuni- average amount of internet data, average monthly ties. Data collected by telecommunication compa- call time, amount of roaming and special services nies on a daily basis can be very helpful in gaining a used). In the developed model, the genetic algo- proper understanding of customer behavior. Anal- rithm, by iteratively adjusting the coordinates of ysis of such data is needed to understand what fac- each profile in the plane, creates certain groups of tors may be associated with a customer leaving the similar elements. The efficiency of the model was customer base. The main goal of this research is to evaluated by observing the change in the error func- develop a churn prediction model which identifies tion in each iteration. The algorithm grouped cus- customers who most likely subject to churn. tomers into four clusters - 1% of very high-spending customers, 9% of high and medium-spending cus- 2. Literature review tomers, 12% of medium-spending customers, and 78% of low-spending customers. The term customers churn can be described as the The study [3] was conducted by analyzing data loss of customers to a company [1]. Due to the from 7043 telecommunication customers, 1869 of specifics of telecommunications companies, this is whom had already left the customer base. Each customer is described by 21 variables, one of which IVUS 2022: 27th International Conference on Information is a binary, which indicates whether the user has al- Technology, May 12, 2022, Kaunas, Lithuania ready left the company. Using XGBoost (Extreme $ monika.zdanaviciute@vdu.lt (M. Zdanavičiūtė); Gradient Boosting Tree), k-Nearest Neighbors and ruta.juozaitiene@vdu.lt (R. Juozaitienė); Random Forest methods, customers are classified tomas.krilavicius@vdu.lt (T. Krilavičius) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings into two groups according to this variable. Accu- arates customers into two groups. racy and F-score measures were used to assess the The study [6] uses telecommunication users data accuracy of the models, which showed that the XG- for customer analysis, which stores basic user infor- Boost method was the best classifier. In addition, mation (age, gender, etc.), plan order information this method was used to find out which variables (payment method, monthly fee, full-time fee, etc.). most influence customer exit. This study found It also provides information about the services that customers with higher monthly charges are (telephones, internet, television, insurance, etc.) more likely to churn. and information on whether the customer is ac- Data from the SyriaTel telecommunications ser- tive or has already churned. Clustering (k-Means, vice provider were used for the [4] study. The DBSCAN) and classification methods (Multi-Layer analyzed period is 9 months (about 10 million Perceptron, Back Propagation algorithm, Decision users), and the available information includes data Trees, Logistic Regression, Support Vector Ma- about client (age, gender, place of residence, type chine) were used to analyze this data. The classifi- of contract concluded, services received), his ac- cation models were evaluated with several measures tions (calls, messages, and internet usage), mobile of Precision and Accuracy, and the Back Propa- device (device type, brand, model), and telecom- gation algorithm and Multi-Layer Perceptron best munications tower infrastructure. To better de- predicted the customer’s retraction. When cluster- scribe users, the available data was used to create ing analysis was applied to groups, better active a social network for all customers and to calculate and inactive clients were separated using the DB- variables such as degree centrality measures, simi- SCAN algorithm. larity values, and customer’s network connectivity The customer loyalty task is usually formulated for each customer. For model training and testing, as a classification task, the data set of which con- data were separated into training (70%) and testing sists of active and churned users. To solve this (30%) sets. Because the data sets were unbalanced problem, the literature suggests the use of 𝑘– (there were significantly more outgoing customers Nearest Neighbors [3], Neural Networks [5][6], Sup- than active ones), the classification was done in port Vectors [6][5], and Bayesian classifiers [5]. De- two ways: by balancing the sets and applying the cision Tree, Random Forest, and XGBoost algo- data as it is. Using Decision Trees, Random For- rithms can also be used to analyze customer loyalty est, GBM (Gradient Boosted Machine Tree), and [3][4]. However, there are also cases when this prob- XGBoost algorithms, customers were classified into lem is solved by applying clustering methods, e.g. two classes: churned and existing customers. The Genetic [2] or 𝑘–means algorithm [6]. This type of AUC (Area under curve) was used to determine ac- task is usually based on user information, as well curacy. The obtained results showed that the XG- as payment history and calls data. Research shows Boost algorithm classifies customers best according that customers who pay more for services tend to to the available data. change telecommunications operators. The data set for the study [5] consists of call 3. Methods records obtained from the University of California, Department of Information and Computer Science. The data set provides information on the use of the 1. k-Nearest Neighbors is an algorithm that 3333 customer mobile system, which consists of 15 stores all available cases and classifies new quantitative, 5 categorical variables and a binary cases based on a similarity measure (dis- variable, describing whether the customer has left tance functions). Euclidian distance func- the customer base of the telecommunications ser- tion [7]: vice provider. In the analysis of the available call ⎯ ⎸ 𝑘 data, each user is assigned variables describing his ⎸∑︁ ⎷ (𝑥𝑖 − 𝑦𝑖 )2 call habits and, using classification methods, these (1) 𝑖=1 customers are divided into two classes according to said binary variable. The research uses Neural 2. Support Vector Machine (SVM) performs Networks, Support Vector Machine and Bayesian classification by finding the hyperplane that classification methods. The data set is divided into maximizes the margin between the two training (80% of all data) and testing (20% of all classes [7]. Hyperplane equation: data) sets so that the training set is balanced. Then 95% of the customers who leave and 5% of the ex- 𝑤𝑇 𝑥 + 𝑏 = 0 (2) isting ones remain in the testing set. The study revealed that the support vectors method best sep- To define an optimal hyperplane we need to maximize the width of the margin (𝑤): 7. Confusion matrix was used to assess the ac- curacy of user classification. Elements of the 2 confusion matrix: max (3) ‖𝑤‖ • TP (true positive) - The user is ex- 3. Decision Tree is a flowchart-like structure in pected not to churn and he remains. which each internal node represents a "test" • TN (true negative) - The user is ex- on an attribute, each branch represents the pected to churn and he churns. outcome of the test, and each leaf node rep- • FP(false positive) - The user is ex- resents a class label [8]. A quantitative mea- pected to remain but he churns. sure of randomness, entropy, is used to select • FN (false negative) - The user is ex- a feature in a node. The initial entropy of pected to churn but remains. the set 𝐸: 4. Data set ∑︁ 𝐻(𝐸) = − 𝑃 (𝑘𝑖 |𝐸) log2 𝑃 (𝑘𝑖 |𝐸), (4) 𝑘𝑖 ∈𝐾 The analyzed data consists of three data sets in the where range from 2020-01-01 to 2022-03-07: |𝑒 : 𝑒 ∈ 𝐸, 𝑒 ∈ 𝑘𝑖 | 1. Users data set. Individual users informa- 𝑃 (𝑘𝑖 |𝐸) = , (5) |𝐸| tion, which includes demographic and other data provided during registration. This mean 𝐸1 , . . . , 𝐸𝑛 entropy after division: study analyzes 21128 users. 𝑛 ∑︁ 2. Payments data set. 140970 payment records 𝐵(𝐸, 𝑝) = 𝑃 (𝑣𝑗 |𝐸)𝐻(𝐸𝑗 ), (6) showing when and what type of plan was 𝑗=1 purchased and how much it cost. There are two types of plans: monthly and yearly. where 3. CDR (call detail record) data set. A real- 𝐸𝑗 𝑃 (𝑣𝑗 |𝐸) = . (7) time data records documenting telephone 𝐸 calls or other telecommunications operations 4. Random Forest is an ensemble learning (3350379 records). method for classification tasks that operates by constructing a multitude of decision trees After the data transformations, a list of variables at training time. The output of the random describing the users was created (Table 1). forest is the class selected by most trees [9]. 5. Naı̈ve Bayes classifier assume that the ef- Table 1 fect of the value of a predictor (𝑥) on a Created user-defining variables given class (𝑐) is independent of the values Created variables used for churn prediction of other predictors [7]. This assumption is Total amount of seconds called class conditional independence. Total amount of calls Number of failed calls 𝑃 (𝑋|𝐶)𝑃 (𝑐) Ratio of failed calls to total calls 𝑃 (𝑐|𝑥) = . (8) 𝑃 (𝑥) The amount of not failed calls Total amount of active days 6. Cox proportional hazard model with time Mean call duration varying covariates is method for investigat- Max call duration ing the effect of several variables upon the Median between calls time a specified event takes to happen. In a Median between active days Cox proportional hazards regression model, Number of contacts called the measure of effect is the hazard rate. Total amount of purchased plans Hazard function for individual 𝑖: Last plan before (amount of days) Total amount paid ℎ𝑖 (𝑡) = ℎ0 exp(𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + . . . + 𝛽𝑛 𝑥𝑖𝑛 ) (9) where ℎ0 (𝑡) is the baseline hazard function, 𝑥𝑖1 , 𝑥𝑖2 . . . . , 𝑥𝑖𝑛 – covariates, 𝛽1 , 𝛽2 , . . . , 𝛽𝑛 – regression coefficients. 5. Churn definition In order to assess the risk of customer’s churn, the definition of churn must first be de fined. Since user leaving the customer base can be described in sev- eral ways, it is necessary to monitor client behavior and changes in activity and decide which definition best describes churn. In the study, user churn is described in two different w ays. Different problem- solving methods are used for each of these two op- tions. 1. The user is classified as a churned customer if he has not purchased a new plan 35 days after the first plan purchase. Figure 2: Frequency of number of days between Figure 1 shows a bar graph showing the dis- plans tribution of the number of plans purchased by customers. It shows that most customers have only bought one plan. plans, so it was decided to define churn in another, more universal way. 2. The user is classified as a churned customer if he does not use the services provided by the company for 25 consecutive days (does not call anyone). To find the optimal interval of days, af- ter which we could treat the user as leav- ing, rather than just taking a break between calls, a percentage of users returned to the system after 𝑥 days of inactivity is calcu- lated. In the graph shown in Figure 3, the abscissa axis reflects the number of inactive days 𝑥, and the ordinate axis corresponds to the number of users (in percent). The blue Figure 1: Frequency of number of plans bar then shows the percentage of users who pur-chased by the user had an 𝑥-day interval between calls, and the red bar represents the percentage of users who returned to the system after 𝑥 days (call The distribution of intervals between plan again). It can be seen from this graph that orders for users who have purchased more almost all users have had a one-day inter- than one plan is shown in Figure 2. It shows val (𝑥 = 1) between calls and only about that most plans are ordered every 30 days, in 60% of them have returned to the system other words, most plans are ordered on a after this interval. Nearly 80% of users have regular monthly basis. There are also some had a thirty-day interval (𝑥 = 30) between users who order multiple plans on the same calls, with less than 25% returning to the day. The data set for the classification mod- system. There is no clear break in the num- els consists of variables describing user be- ber of users who have not returned to the havior (Table 1), calculated on the 25th day system, but there is a steady decrease in the after the purchase of the first plan. Class la- number of users who have returned to the bels indicate whether the customer has pur- system. It has been decided that 25 days is chased a second plan within 35 days after a sufficient period of inactivity to consider a the first plan. Five different methods are user leaving the system. used for classification: k-Nearest Neighbors, The user is monitored from the first day of Sup-port Vectors Machine, Decision Tree, registration until churn (25 inactive days in Ran-dom Forest and Naı̈ve Bayes classifier. a row). In this case, it is not the static This definition of churn can only be used to variables that are observed, but their change predict consumers purchasing monthly Figure 3: Frequency of users returning to the system after x days of inactivity over time. 25th day after the purchase of the first p lan. An There are times when after a long break (af- attempt is then made to assign the user to one of ter so called churn) the user returns to the the classes (predicted to remain in the system or system and starts using the services again. leave). The characteristics describing user activity For such cases, the algorithm is designed so are presented in Table 1. that withdrawn customer is still monitored, Customers are divided into model training (70% and when he returns to the system (calls data) and testing (30% data) sets. Five differ- again), he is treated as a newly logged-in ent methods are used for classification: k-Nearest user. Neighbors, Support Vectors Machine, Decision Tree, Random Forest and Naı̈ve Bayes classifier. The values of the confusion matrix elements 6. Experiments evaluating the accuracy of the listed classification methods and the percentage accuracy of all models 6.1. Evaluating the purchase of a plan is given in Table 2. It can be seen that the SVM Users classification is performed by dividing cus- model achieves the best accuracy. tomers into these two groups: Table 2 • An active user is one who re-orders a plan Evaluation of classification models accuracy within 35 days after the first order of the Method TP TN FP FN Accuracy plan. k-Nearest Neighbors 634 3747 650 1308 69.11 % • Withdrawn - did not order another plan Support Vector Machine 631 4094 303 1311 74.54 % within 35 days after ordering the first plan. Decision Tree 688 4028 369 1254 74.40 % The selected forecast period is 10 days. User Random Forest 780 3937 460 1162 74.41 % data is tracked for 25 days from the first plan pur- Naı̈ve Bayes classifier 629 3931 466 1313 71.94 % chase. Based on these data, the characteristics de- scribing the user’s behavior are calculated on the 6.2. Estimating inactive time intervals curacy and are more suitable for predicting churn between calls than M2 model. 5000 users were randomly selected from the full list Table 4 of users. For each of them, the variables in Table Evaluation of Cox models accuracy 1 are calculated on every day from the time the Model TP TN FP FN Accuracy user registers until he leaves or until the end of the M1 1059 529 236 917 57.94 % entire data range. This results a data set in which M2 991 539 226 985 55.82 % each user is described not by one row, but by as M3 1059 529 236 917 57.94 % many rows as number of days the user has been in the system for. Having this data set it is possible to track how user activity has changed over time. Any user who leaves the system for more than 25 7. Conclusion days (does not call anyone for 25 days) and then returns to it (calls again) is treated as a new user Experiments with the telecommunication customer (assigned a new identification n umber). As a result, data set show that: the creation of such a user data set increases the number of users to a total of 15435. A training set 1. After assessing the specifics of the available (70 % of these users) is used to create the model. data, it was decided to define user activity To select the most appropriate Cox regression in two ways: according to the plans to be model, three different c ombinations o f variables purchased and according to the frequency of were created and three models were constructed. calls made. Table 3 shows the variables for all three models. In 2. In the case where the customer is consid- each case, only non-correlated, statistically signifi- ered active as long as he regularly pur- cant variables are included in the model. chases the call plans offered by the sup- plier, the following classification algorithms Table 3 were used to segment the users: k-Nearest Variables describing user behavior in Cox models Neighbors, Support Vector Machine, Deci- sion Tree, Random Forest, and Naı̈ve Bayes Variable M1 M2 M3 active days count x x x classifier. Based on experimental studies, mean call duration x x x it can be stated that the classification of max call duration x x Support Vector Machine outperforms other number of contacts x x x methods. nul call ratio x x x 3. In the case where user activity is defined by a median days between calls x x x faster-changing indicator – the frequency of max days between calls x x calls, it was decided to use the Cox regresion median days between active d x x x model with time varying covariates to divide total paid x x x users into groups. This model is superior last plan before x x x to classical classification methods in that it can take into account not only static user The accuracy of these three models was assessed parameters but also their change over time. using a test set (30% of users). Four dates in the an- In further research the possibility of combining alyzed period were selected for model testing. Ac- these two methods to predict the likelihood of cus- tive users are selected on a specific date and it is tomer churn may be considered. predicted that after 10 they will still be active or churned. The same is repeated with four differ- ent dates. In this way, the real performance of the References model is verified, when both short-term customers and long-term customers are evaluated. Some users [1] B. Huang, M. T. Kechadi, B. Buckley, Cus- may have been analyzed several times at different tomer churn prediction in telecommunications, times. In total, the model evaluated customers Expert Systems with Applications 39 (2012) 2741 times, of which 1976 users did not quit and 1414–1425. 765 when users left the system. The accuracy of [2] H. REN, Y. ZHENG, Y. rong WU, Clustering the models is presented in Table 4. It can be seen analysis of telecommunication customers, The that M1 and M3 models have achieved equal ac- Journal of China Universities of Posts and Telecommunications 16 (2009) 114 – 128. URL: http://www.sciencedirect.com/science/ article/pii/S1005888508602149. doi:https: //doi.org/10.1016/S1005-8885(08)60214-9. [3] J. Pamina, B. Raja, S. SathyaBama, M. Sruthi, A. VJ, et al., An effective classifier for predict- ing churn in telecommunication, Jour of Adv Research in Dynamical & Control Systems 11 (2019). [4] A. K. Ahmad, A. Jafar, K. Aljoumaa, Cus- tomer churn prediction in telecom using ma- chine learning in big data platform, Journal of Big Data 6 (2019) 28. [5] I. Brânduşoiu, G. Toderean, H. Beleiu, Meth- ods for churn prediction in the pre-paid mo- bile telecommunications industry, in: 2016 International Conference on Communications (COMM), 2016, pp. 97–100. doi:10.1109/ ICComm.2016.7528311. [6] I. M. Mitkees, S. M. Badr, A. I. B. ElSeddawy, Customer churn prediction model using data mining techniques, in: 2017 13th International Computer Engineering Conference (ICENCO), IEEE, 2017, pp. 262–268. [7] J. Han, J. Pei, M. Kamber, Data mining: con- cepts and techniques, Elsevier, 2011. [8] G. Norkevičius, G. Raškinis, Lietuvių kalbos garsų trukmės modeliavimas klasifikavimo ir regresijos medžiais, naudojant didelės apimties garsyną, Informacinės technologijos 2007: kon- ferencijos pranešimų medžiaga, Kauno tech- nologijos universitetas, 2007 m. sausio 31 d.- vasario 1 d. Kaunas: Technologija, 2007 (2007). [9] G. Biau, E. Scornet, A random forest guided tour, Test 25 (2016) 197–227.