Understanding telecom customer churn with
     machine learning: from prediction to causal
                     inference?

           Théo Verhelst1 , Olivier Caelen2 , Jean-Christophe Dewitte2 ,
                  Bertrand Lebichot1 , and Gianluca Bontempi1
              1
               Machine Learning Group, Computer Science Department,
                  Université Libre de Bruxelles, Brussels, Belgium
                          {tverhels,gbonte}@ulb.ac.be
                       2
                         Data science team, Orange Belgium
              {olivier.caelen,jean-christophe.dewitte}@orange.be

    Telecommunication companies are evolving in a highly competitive market
where attracting new customers is much more expensive than retaining existing
ones [3]. Retention campaigns can be used to prevent customer churn, but their
effectiveness depends on the availability of accurate prediction models. Churn
prediction is notoriously a difficult problem because of the large amount of data,
non-linearity, imbalance and low separability between the classes of churners and
non-churners. In this paper, we discuss a real case of churn prediction based on
Orange Belgium customer data.
    In the first part of the paper we focus on the design of an accurate predic-
tion model. The large class imbalance between the two classes is handled with
the EasyEnsemble algorithm [4] using a random forest classifier. The dataset
contains 73 variables and about 7.6 million entries, covering a 5 months time
window in 2018. The classification model is trained on the first 4 months of
data and evaluated on the last month. We also assess the impact of different
data preprocessing techniques including feature selection and engineering. Re-
sults show that feature selection can be used to reduce computation time and
memory requirements, though engineering variables does not necessarily improve
performance.
    In the second part of the paper we explore the application of data-driven
causal inference, which allows to infer causal relationships between variables
purely from observational data. More specifically, we applied 5 different causal
inference methods, namely PC [6], Grow-shrink (GS) [5], Incremental Associ-
ation Markov Blanket (IAMB) [7], Minimum interaction maximum relevance
(mIMR) [2] and D2C [1]. PC infers the set of causal graphs faithful to the dataset,
GS and IAMB infer the Markov blanket of the churn variable, and mIMR and
D2C return the direct causes of churn. Two implementations of the mIMR al-
gorithm are used: one based on histograms to estimate mutual information, and
another assuming Gaussian variables, thus allowing a closed-form formula for the
computation of the mutual information. The results of these algorithms (summa-
?
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2        T. Verhelst et al.

rized in figure 1) are varied and are consistent with prior knowledge of the causes
of churn. We conclude that the bill shock and the wrong tariff plan positioning
are putative causes of churn.
    Finally, we present a novel method
to evaluate, in terms of the direction
and magnitude, the impact of causally                          IAMB, mIMR 1, D2C     mIMR 1, D2C

relevant variables on churn. We eval-           GS,  mIMR   2
                                                                    Tariff plan         Province
uate the average probability of churn         Number contracts
predicted by the learning algorithm
on the dataset, before and after a shift        Out of bundle         Churn            Age     mIMR 2

of the values of the variable of interest.    GS, mIMR   2
                                                                                Messages, voice calls
The difference between these two av-              Data usage         Tenure
erage probabilities is a measure of the                                                 GS
                                              GS, mIMR 1, D2C    GS, mIMR 1 & 2
effect of a manipulation of the variable
on churn. This method is based on the
assumption that no latent variables Fig. 1. Summary of results of causal infer-
are confounding factors of churn and ence. mIMR 1 stands for the histogram-
                                           based estimator, and mIMR 2 for the es-
the variable under inspection. Results
                                           timator with Gaussian assumption.
show that, on the one hand, some vari-
ables such as the tenure and the num-
ber of contracts are observed to be
monotonically associated with the churn probability. On the other hand, some
variables have a non-monotonic causal influence on churn. For example, variables
related to the amount paid by the customer and the data usage cause more churn
when they are increased, but the opposite is not true.


References
1. Bontempi, G., Flauder, M.: From dependency to causality: a machine learning ap-
   proach. The Journal of Machine Learning Research 16(1), 2437–2457 (2015)
2. Bontempi, G., Meyer, P.E.: Causal filter selection in microarray data. In: Proceed-
   ings of the 27th international conference on machine learning (icml-10). pp. 95–102
   (2010)
3. Hadden, J., Tiwari, A., Roy, R., Ruta, D.: Computer assisted customer churn man-
   agement: State-of-the-art and future trends. Computers & Operations Research
   34(10), 2902–2917 (2007)
4. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learn-
   ing. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)
   39(2), 539–550 (2009). https://doi.org/10.1109/tsmcb.2008.2007853
5. Margaritis, D., Thrun, S.: Bayesian network induction via local neighborhoods. In:
   Advances in neural information processing systems. pp. 505–511 (2000)
6. Spirtes, P., Glymour, C.: An algorithm for fast recovery of sparse causal graphs.
   Social science computer review 9(1), 62–72 (1991)
7. Tsamardinos, I., Aliferis, C.F., Statnikov, A.R., Statnikov, E.: Algorithms for large
   scale markov blanket discovery. In: FLAIRS conference. vol. 2, pp. 376–380 (2003)