Data Mining Methods for Evaluation and Forecasting the
Mobile Internet Traffic in Roaming
Nataliia V. Kuznietsovaa, Petro I. Bidyuka and Anastasiia V. Kulinicha
a
    National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv, 03056, Ukraine


                 Abstract
                 This paper is dedicated to the solving such real actual practical task for telecommunication
                 industry as forecasting the services in roaming. It is concentrated on the study of the data
                 mining methods that allow to predict the volume of services (calls, traffic) for a particular
                 subscriber abroad on the bases of available statistical information. The problem was shared
                 in two tasks: forecasting the volume of the internet traffic in roaming and the clients’
                 classification due to their behavior in roaming. The task of evaluation and forecasting was
                 solved with the models based on the time series theory and appropriate autoregression
                 models. The best model was selected based on the statistical criteria and used for
                 forecasting the volume of traffic in next months. The task of classification was solved by
                 such data mining methods as neural networks, gradient boosting, random forest and logistic
                 regression. The model based on the gradient boosting was selected because of the highest
                 completeness and accuracy to the input data. In accordance to received modelling results the
                 recommendations and special strategies for the telecommunication company were
                 developed.

                 Keywords 1
                 Data Mining, Gradient Boosting, Random Forest, Neural Networks, Logistic Regression,
                 Bootstrap analysis, Time Series, Mobile Internet Traffic, Forecasting, Roaming

1. Introduction
   Today mobile operators are actually interested in resuming their services including mobile
roaming, which was observed before the Covid-19 epidemic. In order to predict the mobile Internet
volume, including abroad [1], it is necessary to determine optimal packages and behavior of
subscribers [2, 3], i.e. how they use certain services within their own country. In the future, it will be
possible to predict how such subscriber will behave abroad using modern methods of data mining
and forecasting [4 – 12].
   Special mention should be paid to virtual mobile operators (for example, LycaMobile), which use
roaming technology and rent telecommunication towers and equipment of other mobile operators to
provide services. Virtually every subscriber of a virtual operator is in roaming within his own
country, and therefore information on the Ukrainian services usage can be used to predict his
behavior abroad.
   Roaming is a situation when a mobile operator subscriber (of the home network) uses the
network of another mobile operator outside the geographical coverage area of the home network.
The peculiarity of doing business in the field of international roaming is the need to make a huge
number of deals (agreements) with major international telecommunication operators providing
services in more than a hundred countries [1]. The national operator should take into account that
certain services in some telecommunication companies abroad are extremely expensive for both the
operator and the subscriber. Thus, roaming tariffs in such countries should be formed and allocated

CMIS-2021: The Fourth International Workshop on Computer Modeling and Intelligent Systems, April 27, 2021, Zaporizhzia, Ukraine
EMAIL: natalia-kpi@ukr.net (N. Kuznietsova); pbidyuke_00@ukr.net (P. Bidyuk); a.kulinich2509@gmail.com (A. Kulinich)
ORCID: 0000-0002-1662-1974 (N. Kuznietsova); 0000-0002-7421-3565 (P. Bidyuk); 0000-0001-7156-109X (A. Kulinich)
            © 2020 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
in a separate group, because for the mobile operator one minute can cost much more than the
company can receive from the subscriber.
    The national mobile operator must also provide appropriate conditions for the usage of mobile
services for its subscribers around the world. Services that subscribers use abroad can be divided into
several types including voice communication, messages and mobile Internet.
    Usually Ukrainian subscribers abroad prefer voice calls and mobile data usage. It was expected
that with providing new 4G standard the traditional telecommunication services such as voice calls
and short messages (SMS) remained in the past. Now there is an actual decrease in the number of
subscribers who use standard short messages but with voice calls it is not so clear. The people needs
in live communication which is impossible in today’s lockdown and quarantine situations have led
to a huge increase in demand for voice calls. Therefore, now this service remains one of the most
developed and necessary for users and therefore relevant in roaming. Let's take a closer look at the
tasks of forecasting services and tariff packages faced by telecommunication companies to provide
users in roaming. To do this, first the problem of the traffic in roaming amount forecasting will be
solved. Next the subscribers’ classification according to mobile Internet tariff packages in roaming
will be performed. As a result, it will be possible to determine which tariff packages are relevant
now for subscribers, whether it is needed to develop one or more different proposals.

2. Problem statement
   This work is concentrated on the study and research of the information tools that allow one to
predict the volume of services (calls, traffic) for a particular subscriber abroad on the bases of
available statistical information. Mathematical models need to be developed to analyze consumption
and forecast the services that will be used abroad. It is necessary to predict the mobile Internet traffic
volume in roaming using the statistics on subscriber behavior and perform the subscribers’
classification in order to develop relevant packages and offers. Finally it is needed to develop special
techniques and approaches to increase the number of subscribers who will use these services in
roaming.

3. The main methods used in the research
    According to the seventh international workshop held by SAS in September 2020 "Real Time
Analytics & Cyber Security 2020", the "big four" of the most promising and relevant methods for
solving analytical problems in financial sector were as follows: neural networks [11– 13], gradient
boosting, random forest and support vectors machine [14 – 16]. It was also noted that the high
results for classification shows the logistic regression method. So, based on the international
experience we chose these methods to solve the problem of classification for our tasks. For the
internet traffic forecasting the theory of time series analysis [8, 9, 17, 18] was applied and some
autoregressive models were explored and built after preliminary data processing.

3.1.     Logistic regression
   Logistic regression is a statistical method used to analyze a data set consisting of one or many
independent characteristics that affect the outcome. The result is evaluated using a dichotomous
variable, which indicates with what probability the result belongs to a particular class [10]. The
logistic regression algorithm uses a linear equation (boundary function) with independent variables
to determine which class the data belongs to [4]. This equation describes the linear boundary that
separates the input data space.
   The limit function can be generally written as follows:
                               Score( xi )  w0  w1 x1  ...  wn xn  wT  xi ,                    (1)
where x i is input variable feature, w0 is decision making threshold, w1 ,..., wn is vector of weights.
    In order to obtain the probability value [0; 1] of the class membership from the boundary
function, the following logistic function is used:
                                                                              1
                                                      log it ( Score)               .                               (2)
                                                                        1  e Score
    First of all, to construct the limit function it is necessary to find the coefficients, w1 ,..., wn . To do
this, it is necessary to determine the training sample, which consists of independent variables
(characteristics) and the corresponding values of the dependent variable y (initial result). Formally, it
is a set of pairs, ( x (1) ; y (1) )...( x ( m ) ; y ( m ) ) , where x (i )  R n is the vector of independent variables
values and y (i )  {0,1} is the corresponding value of y . Each such pair is called the learning
example. Usually the method of maximum likelihood is used and the parameters are selected so that
to maximize the value of the likelihood function in the training sample [4]:
                                                                 m
                   W  arg max w L(W )  arg max w  P{ y  y (i ) | x  x (i ) } .                                 (3)
                                                                i 1
   Maximizing the likelihood function is equivalent to maximizing its logarithm:
                    log L(W )  i 1 log P{ y  y (i ) | x  x (i ) }                                             (4)
                                       m

                                                                                                    .
                     i 1 y (i ) log f ( wT x (i ) )  (1  y (i ) ) log(1  f ( wT  x (i ) ))
                          m


where x (i )  w0  w1 x1  ...  wn xn .
    To maximize this function the gradient descent method could be used. By setting some initial
value w0 a maximum can be found iteratively [4]:
                        w  w0   log L( w)  w0   i 1 ( y (i )  f ( wT x (i ) )) x (i ) ,   0.
                                                                       m
                                                                                                                    (5)


3.2.      Neural networks
   The most modern neural networks are constructed of formal neurons that resemble their
biological prototype. The structure of the neuron consists of x1 ,..., x n are the values that are fed to
the inputs (synapses) of the neuron; w1 ,..., wn are weighting coefficients of synapses, which can
have both slowing down and strengthening effect; S is the weighted sum of the input characteristics:
                                                   i1 i                                            (6)
                                           n
                                    S        w  x  T,         i


T is neuron threshold (omitted in many models), F is the neuron activation function that converts
the weighted sum into an output signal: y  F ( S ) [20].
    The neurons are regularly organized into layers, and the elements of a layer are associated only
with the neurons of the previous layer and the information spreads from the previous layers to the
next. The input layer, which consists of sensitive (sensory) S-elements, which receives the input
signals, X i , does not perform any information processing and performs only distribution functions.
Each S-element is associated with a set of associative elements (A-elements) of the first intermediate
layer, and the A-elements of the last layer are connected to the reacting elements (R-elements) [20].
    Weighted combinations of R-element outputs determine the system response which indicates that
the evaluated object belongs to a certain image. If only two images are recognized, then one R-
element is installed in the perceptron, which has two reactions – positive and negative. If there are
more than two images then for each image its R-element is set and each such element output is a
linear combination of output A-elements.
    Neural networks are now one of the most common methods that is being developed, and adapted,
using recurrent algorithms (RNN), long short-term dependency learning (LSTM), to solve regression
problems (GRNN). Therefore, the use of neural networks is appropriate for our problems.
3.3.      Random Forest
    The random forest method is based on the large number (ensemble) of decision trees (this number
is a parameter of the method) construction, each of which is built on a sample obtained from the
original training sample using bootstrap (i.e. sample with return), in contrast to classical construction
algorithms decision trees [15].
    The bootstrap procedure is randomly retrieving repeated samples from the empirical distribution
multiple times. Specifically, if we have an initial sample of n terms, x1 ,.x 2 ,.., x n1 , x n , then by using
the random numbers generator evenly distributed on the interval [1, n] , we can extract from it an
arbitrary element x k , which will be returned to the original sample for possible re-extraction. This
procedure is repeated n times. A bootstrap sample is formed where some elements can be repeated
two or more times, while other elements are absent. For example, for n  6 one of such bootstrap
combinations has the following form: x1 , x2 , x2 , x1 , x4 , x5 [21].
    Bootstrap samples are performed evenly and with a return, so some initial samples will be
missing while others will be duplicated: on average, one such sample contains about 2/3 of unique
initial observations. Bootstrap was particularly useful in models ensemble formation especially in
combination with tree-like structures which are very sensitive to small changes in training data [6].
    As the averaging of several observations reduces the data variance estimation the same
reasonable way to reduce the variance of the forecast is to obtain a large number of data from the
general population, building a predictable model for each training sample and averaging of the
obtained forecasts. If instead of separate training samples to perform bootstrap and based of the
generated pseudo-samples to build B regression trees, the average collective forecast will have a
lower variance:
                                     fˆ    ( f 1 ( x)  f 2 ( x)  ...  f B ( x)) / B .
                                         bag                                                                  (7)
    This procedure is called bagging (short abbreviation for bootstrap aggregating). Bagging can be
performed not only for regression trees but also to other models [10].
    The random Forest is an improvement of decision tree bagging which aims to eliminate the
correlation between trees. As with bagging we build several hundred decision trees based on training
bootstrap samples. However, at each iteration of constructing the trees are randomly selected m
from p as to be considered as predictors and it is allowed to perform partitioning only by one of
these m variables [15]. The meaning of this procedure is quite effective for improving the quality of
                                                             pm
the obtained solutions and it is that with the probability          any potentially dominant predictor
                                                               p
that seeks to enter each tree is blocked. If the dominance of such predictors is allowed then all the
trees as a result will be very similar to each other. Also the obtained on their basis forecasts will be
strongly correlated and the decrease in variance will not be so obvious. By blocking the dominants
other predictors will get their chance and the tree variation increases. Choosing a small value of m
when constructing a random forest will be useful in case of a large number of correlating predictors.
Naturally, if a random forest is built using m  p then the whole procedure is reduced to a simple
bagging [20].
    Random forests provide a significant increase in accuracy while the trees in the ensemble are
weakly correlated due to the double injection of randomness into the inductive algorithm – by
bagging and random subspaces methods for splitting each vertex; they don’t exhibit the overfitting
problem. They are easy for usage: the only algorithm parameters are the trees number in the
ensemble and the number of traits randomly selected for splitting at each top of the tree.

3.4.      Gradient Boosting
   Now let’s consider the problem of recognizing objects from the multidimensional space X with
the label space. Let a training sample, {xi }iN1 , where xi  X is given. And let are known the true
values of the labels for each object, { yi }iN1 , where yi  Y . It is necessary to build a recognition
operator that can predict the labels for each new object x  X as accurately as possible. Let the
family of the basic algorithms H is given, each element of h( x; a)  H : X  R is determined by
some vector of parameters a  A [7].
   We will search for the final classification algorithm in the form of the composition
                                    m1 m                                                          (8)
                                     M
                        F ( x) 
                           M             b h( x; a ), b  R, a  A .
                                                         m   m    m
                                                                                M
   However, the selection of the optimal set of parameters {a m , bm }m           1 is a very time-consuming
task. Therefore, we will try to build such a composition by greedy manner building, each time
adding to the sum a term which is the most optimal parameter from all possible. We assume that we
have already constructed a classifier Fm 1 of length, m  1 . Thus, the problem is to find a pair of the
most optimal parameters {a m , bm } for the classifier of length, m :
                       Fm ( x)  Fm 1 ( x)  bm h( x; a m ), bm  R, a m  A .                           (9)
   The idea of boosting can be also applied to classification. In case of binary classification, this
means Y  {1;1} . Then it is often assumed that each algorithm h  H returns the actual «degree»
                                                                                ~
of object belonging to a certain class, and the resulting answer F is obtained by applying a
boundary rule to the composition [16].

3.4.1. Multiclass classification
    The idea of boosting for binary classification could be easily generalized to the case of K classes
[22]. Now the following loss function is introduced:
                                     L( y , F )  i 1 y i log pi ( x) .                        (10)
                                                      k


Here yi  {0,1} shows the affiliation of the object of class i , аnd pi shows the probability of
belonging the object to the class , obtained during application of the logistic regression. Write down
the formulas for the class K classifier of multiclass logistic regression:
                                                         1 K                                      (11)
                               f k ( x)  log p k ( x)  i log pi ( x)
                                                        K                  .
It is possible to receive after transformations that:
                                        Qi  yik  p k ,m 1 ( xi ) .                            (12)
If the problem is too complicated for calculations then in case of computational trees it is possible to
use the first step of the Newton-Raffson algorithm as an approximation:
                                  K 1        x jR jkm Qik                                     (13)
                           c jm                                    .
                                   K  x R | Qik |(1 | Qik |)
                                             j   jkm

The i -th classifier search for such object class means that the probability of belonging of other
classes was minimum:
                                             K
                                                     ~                                       (14)
                              c jm  arg min  c(k , k ) p k~m ( x) .
                                        k[1, K ] k 1
In this formula, p k~m ( x) denotes the probability of belonging to class k as a result of the m -th
                                                           ~
iteration of the boosting algorithm. The value of c(k , k ) denotes the error cost function if it is
assumed that the object belongs to the class, k , although in fact it belongs to the class [16].

4. Input data features and characteristics description
   To analyze and predict subscribers who are using roaming services, the mobile operator provided
a sample of 120,000 records – data for subscribers traveling abroad. For 10,000 random records from
each month for the period from August 2017 to July 2018, i.e. for twelve consecutive months. It is
necessary to predict, first of all, whether the subscriber traveling abroad will use communication
services as well as to determine which services (calls or mobile internet) the subscriber will use.
   The input data contains the following characteristics: the previous month before going abroad
(used to display the history of using services in Ukraine); and a set of characteristics that reflect the
features and types of services used by the subscriber in Ukraine (subscriber’s internal tariff plan, the
number of calls minutes, the number of short messages, the amount of GPRS traffic, the cost spent
by the subscriber to use services in Ukraine, the amount and quantity replenishment of the balance in
the specified month, administrative region of Ukraine, in which the subscriber was more than 90
days before the moment of departure abroad). We also add characteristics for our task: the month in
which the subscriber was left abroad, the code of the country tariff group and the name of the
country to which the subscriber left.
   The following characteristics describe the amount of services that the customer used in roaming
during his previous trip abroad. Such information is stored by the mobile operator for the entire
period of customer service, but for our task it is advisable to take into account the previous visit not
earlier than the year of 2016. This is due to significant changes in tariffs and the high cost of
roaming services, and therefore there was very infrequent use of this service before 2016. This is
significantly different from today's situation and the mobile operator faces the task of determining
the subscriber behavior and relevant services in modern conditions.
         HISTORY_ROAM_GPRS is the amount of GPRS traffic in roaming.
         HISTORY_ROAM_MINS is the amount of call minutes in roaming.
         HISTORY_ROAM_SMS is the number of short messages in roaming.
   If the subscriber has not traveled abroad in the last three years, the characteristic values will be
empty. Empty values of these characteristics will be further processed and filled by one of the
methods for incomplete data recovering (average or median value for the tariff group of countries)
[23].
   In total the sample contains 15 characteristics that describe the subscriber behavior in Ukraine
and abroad. An experimental study was conducted and such characteristics from behavioral
characteristics in Ukraine were generalized and selected for further modeling:
         MO_UKR is the sum of outgoing calls made inside the network to other operators and also
   outgoing calls abroad.
         SMS_UKR is the sum of outgoing short messages in the network sent on the phone
   numbers of the national operators and international short messages.
         AMOUNT is the total sum of all services interaction into calls, GPRS and short messages.

5. Data mining methods application to solve the classification and
   forecasting problems
5.1. The task of predicting the services amount that will be used by the
subscriber in roaming
    Based on subscribers' statistics it is necessary to forecast and offer the outgoing subscriber an
appropriate services package for roaming. It is also necessary to predict the amount of mobile GPRS
traffic in roaming, which will be used by the subscriber – the target variable Y _ mi . It is advisable
to try to solve this problem by using different types of regression models [5, 8, 18].
    To forecast the mobile data in roaming usage was used the dataset for the period from January
2016 to July 2018. Altogether, the data was provided for 31 consecutive months. A graphical
representation of the time series for mobile internet usage in this period is shown in Figure 1.
Figure 1: The volume of GPRS mobile data traffic, MB

    The usage of the GPRS service abroad is growing all this period due to a significant reduction in
the prices for the service of subscribers and international partners. Visually, the time series shown in
Figure 1, has an exponential trend, but we still firstly conducted a series of studies on stationarity,
constructed a moving average and standard deviation, performed Dickey-Fuller and KPSS tests,
which indicated that the time series is not stationary (Dickey-Fuller: p-value = 1.00, KPSS criterion:
p-value = 0.05) [24, 25].
    In order to transform the original series of mobile data in roaming to stationary, it is necessary to
remove the exponent by usage the second differences. After this transform the time series became
stationary so autoregression models and autoregression with moving average (MA) models could be
used. For building this models we need to find the order of the model and the order for moving
average.
    The process of parameter selection for the models AR ( p ) , MA(q ) and ARMA( p, q) for the
series is based on partial autocorelation functions to determine the lag and search for the parameters
 p and q . The results of comparative analysis of estimating statistical characteristics for the selected
models are presented in table 1.

Table 1
Comparison of the models which describe the process of using mobile data traffic in roaming
      Models                                          Criteria
                              2
                            R                  AIC               BSC                   DW
        АR(2)              0.95                843               847                   1.97
      MA (6)               0.95                699               709                   1.76
    АRMA (2, 6)            0.98                676               689                   2.12

    The best model was defined the model ARMA (2, 6). In Figure 2 are shown the results of
simulation based on ARMA (2, 6) with the input data values for a given series. Next based on this
model the forecast for the next 7 months ahead was built and provided to the mobile operator for
estimation of further tariff strategy.
Figure 2: The results of the ARMA model (2, 6) in comparison with the original series

5.2. Classification models construction to determine the best package of
services for roaming subscribers
    To develop new tariffs and service packages, a sample of 120,000 records which contained
subscriber’s data who were staying abroad in the period from August 2017 to July 2018 was
provided. The task was to classify the target variable – a package of services for mobile data
transmission in roaming by classification models.
    Let’s preliminary define the main service packages for GPRS mobile internet in roaming. The
target variable Y _ p _ mi for predicting GPRS traffic in roaming can take the following values:
          gr_0 – the subscriber did not use the service at all;
          gr_100 – the subscriber used up to 100 MB;
          gr_500 – the subscriber used from 100 to 500 MB;
          gr_over_500 – the subscriber used more than 500 MB.
    It is known that subscribers when traveling abroad often choose not to use mobile services at all.
For calls in roaming such group of subscribers is near 66%, for GPRS services – 42% (in 50,623
trips subscribers will not use any of the internet service packages). Figure 3 shows the distribution of
classes for different mobile Internet services in roaming packages.


Figure 3:The frequency of the mobile data transmission in roaming usage classes
   The data on the above mentioned 15 characteristics were formed by using the appropriate
technical systems of the mobile operator, the numerical values are accurate, and the term data has a
limited finite number of valuesand can be immediately taken as the categorical features.

5.2.1. Data processing and preparation
   For continuous variables in the input dataset it was used discretization – splitting a continuous
variable into some categories [25]. To convert the categorical features represented by string literals
the direct coding or as it is also called one-hot-encoding (coding with one active state) was used [4].
The basic idea is to replace a category variable with a new variable by the formula for a linear binary
classification. In such a way new features are created for a categorical feature, where           is the
number of categories. Each new feature is a binary characteristic of a certain category.
   After preliminary data processing initial sample was divided into training (75%) and test (25%)
data set. While the sample is unbalanced the class stratification method was used to ensure that
sufficient values were obtained from each group.
   On the next step the sample was balanced by increasing the number of records of smaller classes
to the number of the majority class, increasing their weight for the training sample. A modified
method of over-sampling the SMOTE (Synthetic Minority Over-sampling Technique) [19] was
used, in which records are not simply duplicated but artificially generated based on examples of real
representatives of the class with minor deviations.

5.2.2. The statistical characteristics for checking the classification quality
used in our task
   The most well-known statistical indicators for assessing the quality of classification are general
accuracy, error matrix (Confusion Matrix), first and second kind of errors, index GINI [25]. Usually
the error matrix is determined for binary classification, size 2  2 . In general, the error matrix
 N  N , where N is the number of classes, which shows the correct model predictions, as well as
forecast errors. For our classification problem into four classes the matrix is presented in the form of
the following table 2. Here the classification errors are divided into the following groups:
   FN (False Negative) – first kind of errors, false negative value. For our task such error means that
the forecast value implies a package of services smaller than actually needed.
   FP (False Positive) – second kind of errors, false positive value. Here, this error means that the
forecast value implies a package of services greater than actually needed [22].

Table 2
Confusion Matrix for mobile internet in roaming
                                                             Forecasted values
                                           yˆ  1          yˆ  2          yˆ  3            yˆ  4
 Real values


                      y 1                   TP              FP              FP                FP
                      y2                    FN              TP              FP                FP
                      y3                    FN              FN              TP                FP
                      y4                    FN              FN              FN                TP

    The peculiarity of the classification task of the package services is that the fact that the second
kind errors are more acceptable for marketing decisions than errors of the first kind. Because the first
kind of errors will mean potentially unearned income, and mistakes of the second kind – only a
failed attempt to sell a package of services.
    The values of the confusion matrix are used to calculate the main metrics for estimating the
classification model.
   The overall accuracy of the model is calculated as [26]:
                                               TP
                          accuracy                       .                                         (15)
                                         TP  FP  FN
   To evaluate a model with unequal classes this metric will not fully characterize the model
correctness. On example of the subscribers' behavior in roaming classification it means that if 70%
of subscribers do not use roaming services, then even if all instances are classified as class 0, so the
accuracy of the model equal to 0.7 will be obtained. Obviously, the purpose of the forecast does not
coincide with this result and the model doesn’t have significant practical value.
   The quality assessment of the multiclass classification model is performed using the following
metrics, which are calculated for each class separately.
   Precision shows that some of the objects which were called positive by the classifier and in fact
they were indeed positive.
                                            TP
                           precision                .                                              (16)
                                         TP  FP
   Recall shows which part of the positive class objects out of all the positive class objects were
found by the algorithm.
                                         TP
                          recall               .                                                   (17)
                                    TP  FN
   The recall demonstrates the ability of the algorithm to find this class in general, and precision –
the ability to distinguish this class from other classes [26].
   F-measure (f1-score) is the average harmonic between precision and recall:
                                                          presicion  recall
                                     f   (1   2 )  2                       ,                   (18)
                                                       (  precision)  recall
   where  determines the accuracy measure in this metric and for   1 it is the harmonic mean
(with the factor (1   2 ) equal to 2, so that in the case when precision  1 and recall  1 ) the f1-
score reaches its maximum at completeness and accuracy equal to one. The measure f1-score
approaches zero if one of the indicators approaches zero [26].
   The Matthews correlation coefficient (MCC) in contrast to the previously considered estimates
takes into account all the values of the confusion matrix and is calculated by the following equation:
                                                  K
                                         c  s   pk t k
                                                  k 1
                .         MCC                                         ,                           (19)
                                            K               K
                                    ( s 2   p 2k )( s 2   t k2 )
                                           k 1             k 1
where t k is the number of instances in class k ; p k – how many times the model predicted the class
k ; c – is the number of correctly classified instances; s – is the total number of instances.
   The MCC can take the values from -1 to +1. A model rated to +1 is considered as an ideal. The
model which received a score of -1 is considered very weak.

6. Simulation results
   The simulation was performed on the basis of such methods as: logistic regression, neural
networks, random forest, and gradient boosting. Let’s consider more in detail the results of
application of the methods described above to classify the proposals of the mobile data service in
roaming. Visually the results of each model in the test sample are shown by using the confusion
matrix in Figure 4.
Figure 4: Confusion matrixes for the mentioned above models for mobile data services in roaming
classification

   GPRS service is gaining more popularity among the national mobile operator subscribers not
only in Ukraine but also in roaming. In the case of offering such services one should expect a better
response on this offers from subscribers even if they are from the erroneously predicted model
classes (FP - errors). To select the best model we will use similar indicators of the classification
models quality. The results for all models are gathered together in Table 3.
   Final comparisons of the classifiers were made using the Matthews coefficient: logistic
regression shows MCC=0.36; Neural Network MCC=0.51; Random forest – MCC=0.58; and
Gradient Boosting – MCC = 0.58. The results of the Gradient Boosting and Random forest models
are very similar in all respects. Only f1-score for Gradient Boosting has a few percent better
performance, so this model can be considered as the best one.

7. Discussion
   According to the simulation results received, several possible improvements in the formation of
proposals for subscribers traveling abroad were proposed.
   The first scenario takes into account the current state of the roaming communications market and
aims to analyze customer preferences and offer appropriate service packages. Such proposals may
provide some extra benefits for clients. For example, such as:
  Table 3
  Comparison of classification models for the use of GPRS in roaming
 Class          Method                     precision               recall                   f1‐score
           Logistic regression               0.68                   0.79                      0.73
            Neural Network                   0.78                   0.78                      0.78
gr_0


             Random forest                   0.78                   0.90                      0.83
           Gradient Boosting                 0.76                   0.89                      0.82
           Logistic regression               0.62                   0.34                      0.44
gr_100


            Neural Network                   0.66                   0.61                      0.64
             Random forest                   0.72                   0.67                      0.69
           Gradient Boosting                 0.74                   0.60                      0.66
           Logistic regression               0.36                   0.37                      0.36
gr_500


           Neural Network                    0.50                   0.56                      0.53
             Random forest                   0.62                   0.56                      0.59
           Gradient Boosting                 0.63                   0.60                      0.61
           Logistic regression               0.29                   0.69                      0.41
gr_over


            Neural Network                   0.46                   0.49                      0.47
  500


             Random forest                   0.63                   0.38                      0.48
           Gradient Boosting                 0.53                   0.59                      0.56

            for subscribers whose behavior is classified in the group gr_100 to offer 150 megabytes in
       the second purchased package. This encourages the subscriber to use more than 100 megabytes
       and to order one more package;
            for subscribers who are classified by the model in the group gr_500 (the subscriber used
       from 100 to 500 megabytes) to offer to purchase till promotional date a package of 500
       megabytes with a 5% discount;
            for subscribers who have been classified by the model in the gr_over500 group to offer a 1
       GB service package instead.
       This will encourage subscribers who use roaming services to use a little more services, and thus
   improve the experience and encourage the quantity of people who use the roaming services and thus
   to increase the total amount of mobile traffic in roaming.
       The first scenario can be implemented using classification models. The best model based on the
   Gradient Boosting provides satisfactory accuracy and completeness of classification, which gives
   confidence that the target audience for the marketing campaign will be defined qualitatively.
        The second scenario involves a significant marketing campaign aimed at reducing the percentage
   of subscribers who do not use mobile services when traveling abroad. The first assumption:
   subscribers cannot assess their own need to use services, estimate costs and therefore prefer not to
   use services at all, sometimes even turn off their phones.
       A successful offer for a fixed package of services, for example, 100 megabytes with the specified
   cost and a guarantee that exceeding this limit is impossible and will not incur excessive costs can
   positively affect the decisions of subscribers. The main task is to determine among subscribers who
   do not use roaming services, their behavior characteristics in Ukraine which will give us assumption
   which package should be offered and to whom. For this purpose were developed and trained the
   models mainly on the subscribers’ behavior who use roaming services.
       For the second scenario, one can also use the proposals similar to the first scenario. The equality
   of the second scenario is that you need to understand that the cost of a marketing campaign can be
   much higher and the conversion much lower. The second scenario is designed for the long term
   campaign.
       All developed models help to qualitatively form and determine the target audience for the
   marketing proposals. The decision to provide special offers or inform subscribers is made by
   marketing department depending on the available budget.
8. Conclusions
    Mobile operators perform behavioral users’ analysis in order to develop new tariff packages and
retain existing subscribers among their active users. The Covid-19 crisis has shown that the modern
world is adapting to new conditions at an extremely rapid pace and there is a need to support and
develop telecommunications and Internet services. After the borders are opening between countries
and intensification of the international travel which may take place in early 2022 the tourist and
business flow will resume to the pre-crisis level. It is expected that during the year it will reach the
indicators of 2018 year, so the simulation and forecasting of roaming traffics will allow mobile
operators to develop appropriate tariff policies for roaming subscribers.
    From the all four built classification models for predicting the subscribers behavior in roaming a
model based on the gradient boosting was selected because of the highest completeness and accuracy
to our input data. This model should be used for prudent marketing strategies. For bolder marketing
campaigns such classification model which will contain more second kind errors should be used. In
such case it is predicted that the subscriber should be offered a tariff but in fact he will not use the
service. Such mistakes will allow to conduct a marketing campaign to activate roaming subscribers
who do not use the services. In conditions of increasing numbers of subscribers that get used Internet
the chances of success for such marketing campaign will increase day by day.
    In further research it is planned to analyze and forecast the usage by clients other services in
roaming (calls and SMS), as well as in cooperation with other departments of the mobile operator to
give recommendations on the company's development and distribution of various services in
national and international vectors.

9. References
    [1] European Commission Press Release. Brussels, 17 February 2014. URL:
        http://europa.eu/rapid/press-release_IP-14-152_en.htm.
    [2] N. V. Kuznietsova, Information Technologies for Clients’ Database Analysis and Behaviour
        Forecasting, in: CEUR Workshop Proceeding, 2017, pp. 56-62. URL: http://ceur-
        ws.org/Vol-2067/.
    [3] M. Havrylovych, N. Kuznietsova, Survival analysis methods for churn prevention in
        telecommunications industry, in: CEUR Workshop Proceeding, 2020, pp. 47-58. URL:
        http://ceur-ws.org/Vol-2577/paper5.pdf.
    [4] G. James, D. Witten, T. Hastie, R. Tibshinari, An Introduction to Statistical Learning with
        Applications in R, Springer-Verlag, New York (2013). doi: 10.1007/978-1-4614-7138-7.
    [5] N. S. Papageorgiou, V. D. Radulescu, D. D. Repovs, Noninear Analysis – Theory and
        Methods, Springer, Cham, Switzerland, 2019.
    [6] L. Breiman, Random forests, Machine Learning 45 (2001), 5–32. URL:
        https://doi.org/10.1023/A:1010933404324.
    [7] J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of
        Statistics,      vol.29,      No.5       (2001),       pp.     1189-1232.             URL:
        https://www.jstor.org/stable/2699986?origin=JSTOR-pdf&seq=1 .
    [8] T. M. Rassias, Applications of Nonlinear Analysis, Springer, Cham Switzerland, 2018.
    [9] J. Beran, Mathematical Foundations of Time Series Analysis, Springer, Cham Switzerland,
        2017.
    [10] V. K. Shitikov, S. E. Mastitsky, Classification, regression and other data mining
        algorithms using R(in Russian), 2017. URL: https://ranalytics.github.io/data-
        mining/index.html.
    [11] N. V. Kuznietsova, M. Seebauer, S. Zabielin, Some methods for estimating financial risks
        in banking, IEEE 1st Conf. on System Analysis and Intelligent Computing, SAIC 2018
        (2018), pp. 271–274. doi: https://ieeexplore.ieee.org/document/8516873.
    [12] S. Osovsky, Neural networks for information processing (translation in Russian by I.D.
        Rudinsky), Finance and statistics, Moskow, 2002.
[13] O.V. Gorokhovatsky, O. O. Peredriy, Multilayer perceptron as the primary instrument for
    image clustering (in Ukrainian) / Registration, storage and data processing 18 (2016) 33–43.
[14] C. Cortes, V. Vapnik, Support-vector networks, Machine learning (1995), vol. 20, no. 3,
    273–297.
[15] S.P. Chistyakov, Random forests: an overview (in Russian), Works of Karelian scientistist
    center       of      RAS,         2013,      Issue.      1,      pp.     117–136.       URL:
    http://resources.krc.karelia.ru/transactions/doc/trudy2013/trudy_2013_1_117-136.pdf.
[16] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization,
    2008. URL: http://www.math.washington.edu/ tseng/papers/apgm.pdf.
[17] R. S. Tsay, Analysis of financial time series, John Wiley & Sons, Inc., New York, NY,
    2010.
[18] P. Bidyuk, A. Gozhyj, Y. Matsuki, N. Kuznetsova, I. Kalinina, Modeling and Forecasting
    Economic and Financial Processes Using Combined Adaptive Models, in: Babichev S.,
    Lytvynenko V., Wójcik W., Vyshemyrskaya S. (Eds.), Lecture Notes in Computational
    Intelligence and Decision Making, ISDMCI 2020, Advances in Intelligent Systems and
    Computing, vol 1246, Springer, Cham, 2021. URL: https://doi.org/10.1007/978-3-030-
    54215-3_25.
[19] V. N. Nikulin, I. S. Kanishchev, I.V. Bagaev, Methods of balancing and normalization of
    data to improve the quality of classification, Computer tools in education 3 (2016) 16–24.
[20] V. K. Shitikov, G. S. Rosenberg, T.D. Zinchenko, Quantitative hydroecology: methods of
    systemic identification, IEVB RAS, Togliatti, 2003.
[21] J.H. Friedman, On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality, Data
    Mining          and        Knowledge           Discovery         1,      55–77        (1997).
    https://doi.org/10.1023/A:1009778005914.
[22] F. Herrera, F. Charte, A.J. Rivera, M.J. del Jesus, Multilabel Classification Problem
    Analysis, Metrics and Techniques, Springer International Publishing, Switzerland, 2016.
    doi: 10.1007/978-3-319-41111-8.
[23] Kuznietsova N. V. Analytical Technologies for Clients’ Preferences Analyzing with
    Incomplete Data Recovering, in: CEUR Workshop Proceeding, 2018, pp.118-128. URL:
    http://ceur-ws.org/Vol-2318/.
[24] D. Kwiatkowski, P. Phillips, P. Schmidt, Y. Shin, Testing the null hypothesis of
    stationarity against the alternative of a unit root, Journal of Economics, n. 54 (1992) 159–
    178.
[25] N. V. Kuznietsova, P. I. Bidyuk, Theory and practice of financial risk analysis: systemic
    approach, Lira-K, Kyiv, 2020.
[26] M. Hossin, M.N, Sulaiman, A Review on Evaluation Metrics for Data Classification
    Evaluations, International Journal of Data Mining & Knowledge Management Process 5
    (2015) 1-11. doi: 10.5121/ijdkp.2015.5201.