=Paper=
{{Paper
|id=Vol-2318/paper10
|storemode=property
|title=Analytical Technologies for Clients’ Preferences Analyzing with Incomplete Data Recovering
|pdfUrl=https://ceur-ws.org/Vol-2318/paper10.pdf
|volume=Vol-2318
|authors=Nataliia Kuznietsova
|dblpUrl=https://dblp.org/rec/conf/its2/Kuznietsova18
}}
==Analytical Technologies for Clients’ Preferences Analyzing with Incomplete Data Recovering==
<pdf width="1500px">https://ceur-ws.org/Vol-2318/paper10.pdf</pdf>
<pre>
         Analytical Technologies for Clients’ Preferences
          Analyzing with Incomplete Data Recovering

                                     Nataliia Kuznietsova1
    1
        Institute for Applied System Analysis of National Technical University of Ukraine "Igor
                        Sikorsky Kyiv Polytechnic Institute", Kyiv, Ukraine
                                     natalia-kpi@ukr.net


         Abstract. The paper is devoted to new analytical information technologies for
         clients’ preferences prediction. The problem of clients’ preferences forecasting
         is now actual for many commercial systems, companies, banks, insurance com-
         panies and e-commerce. Various marketing efforts are used to increase demand
         and attract new customers. The main idea is to understand the customer needs
         and preferences and to model their behavior with analytical technologies. Such
         technologies gives the possibility of the clients’ data analysis, customer demand
         evaluation and prediction of the next purchases. The modern approaches to cli-
         ents’ preferences prediction were analyzed and collaborative filtering methods
         were chosen. The formulation of the task modelling in terms of clients as sub-
         jects and purchases as objects was fulfilled. The method for incomplete and
         missing data recovering, which was proposed by the author, consists of such
         stages as sample incompleteness evaluation, analysis if the passes are systemic,
         analyzing of the passes’ causes and effects with using Bayesian network, re-
         gression modelling for passes recovering. The method of implicit feedback with
         the combined method for incomplete and missed data processing were built in
         the existing modern ERP-system and gives the possibility to receive highest ac-
         curacy of clients’ preferences prediction.

         Keywords: Implicit feedback, Missed data, Collaborative filtering, Data Re-
         covering, Clients’ preferences.


1        Introduction

The economic growth of any country, the further development of the economy is
accompanied by increasing of the incomes and profits for companies and countries’
residents at the same time. There is also a significant development of the customer
service sphere, increasing demand of the different products. The client-oriented
strategies become the priority for customer service companies. The corporations are
included in the intensive competition for clients. The key is the understanding of
customer needs and preferences [1, 2], modeling of consumer behavior, and therefore,
most companies and corporations invest heavily in developing their own solutions or
purchasing existing ones [3,4]. Such techniques allow them to analyze customer
                                                                                      119


preferences, their pre-orders and develop models that will include interesting for
customers products, which can be offered to them for the next order.


2      Clients’ Preferences Analysis

A foreign company specializing in the sale of media products - CDs, DVDs, as well
as elite segment products has developed its own Enterprise Resources Planning (ERP)
system for collecting statistical information about online store customers, catalogs of
goods and knowledge bases on the previous experience. Knowledge are in form of
related products recommendations and new acquisitions on the basis of the previous
goods purchased in the online store. The company has more than 2 million unique
customers and more than 5 million orders statistic. Various marketing efforts such as
sending emails are used to increase demand among its users and to attract new
customers. The formation of recommendations for related products is carried out by
an analyst who selects goods by his own algorithm. Such recommendations are not
really precise while people are not able to process such volumes of information.
Analyst uses company's ERP system as an automated workplace. The analytics
recommendations are stored in the knowledge base of the company, together with
information, whether the client used this recommendation.
    The company database includes user information, product information, and date of
purchase. The database contains missed and lost data. The following simulation tasks
are relevant for the company:
      Forecasting: sales estimation, server load or server downtime forecasting for
       providing quick user access to the order directory.
      Analysis and risk assessment and minimization: selecting the most promising
       clients for target e-mailing (the risk of choosing customer who will not buy
       goods. The losses are calculated as the amount spent on such ineffective
       mailings).
      Providing recommendations: identifying products which can be sold together
       with the high probability, creating recommendations for preferences.
      Sequency search: customer choice analysis while making purchases, forecasting
       the next possible event.
      Grouping: dividing customers or events into clusters of related elements,
       analyzing and predicting common features.
    Modern approaches to analyzing and forecasting the preferences and behaviors of
clients within the framework of the advisory systems are ideologically divided into
the following types [2,5]:
1. Collaborative filtering methods.
2. Knowledge-based filtering.
3. Methods based on content analysis (Content-based filtering).
4. hybrid methods.
    Collaborative filtering is the process of filtering information or samples by sharing
multiple technologies, points of review, data sources, and more. Collaborative
filtering is usually associated with very large data sets and therefore is appropriate for
120


using in financial systems, Such systems provide financial services, process large
amounts of information and combine a large number of financial data sources. In the
narrower sense, collaborative filtering is one of the methods for forecast constructing
in recommendation systems that uses well-known user group estimates to predict
unknown user ratings [3]. The main assumption of collaborative filtration is as follow:
those who have equally evaluated any objects in the past tend to give similar
assessments of other subjects in the future.


3      Problem Statement

The following problem was solved: to develop the information technology for auto-
mation the process of recommendation providing on the accompanying products and
next products interesting for the clients.
   Let there exist a matrix R of size             with the subjects (clients), objects
(goods), and some feedback data (previous orders). It is necessary to find a way of
transforming it into one matrix with subjects and their profiles (hidden preferences)
                  and one matrix with objects and their profiles (hidden preferences
that they satisfying)                   The P and Q matrices contain scales that deter-
mine how each subject / object relates to each t. The task is to calculate P, Q in such
way that their multiply approximates R as closely as possible:             .
   In the process of iterative assignment of random values in the matrices P and Q, us-
ing the method of least squares (LS), we must arrive to the same value of the scales
that most closely approximate the matrix R.
   In the LS algorithm consistently, at each iteration, the following states of the sys-
tem alternately change:
   - P is fixed, then optimizing Q;
   - Q is fixed, then P is optimizing.
   This operation is continued until approximation to R≈P × Q will be reached.


4      Criteria for Quality Assessing of the Customer Preferences’
       Prediction

Standard criteria for estimating the quality of forecast such as RMSE or MAE,
couldn’t be used to assess the accuracy of the solution to the problem of analyzing
and predicting customer preferences. It is difficult to estimate if there is a mistake in
forecasting model of the clients’ preferences or it is the client's decision not to buy
this product here and now. Maybe in the future, this customer will buy this product
later. It would be advisable to organize the collection of statistical information about
the fact of the reference product review, but this is also an indirect characteristic,
since the client may not have enough time, but the product is interesting for him, so
the recommendation is correct for this client. Such criteria for quality recommenda-
tions evaluation [5] were used:
                                                                                      121


           Accuracy:                                                         (1)

where      the number of recommended objects with which the subject has an interac-

tion (that is, the number of correctly predicted preferences);  number of recom-
mendations. This criterion indicates which rate of recommendations corresponds to
the preferences of the subject.

          Completeness:                                                        (2)

where      the number of recommended objects with which the subjects had an inter-

action, and       the total number of interactions that was performed by the subjects.
   Recall@k ‒ evaluates which fraction of interactions performed by clients corre-
sponds to the predicted interactions, i.e. how many of the forecasted picked goods
were interesting to customers.
   It is possible to evaluate these equation in cash equivalents, by setting the cost of
each interaction and fines for lack of interaction.


5       Implicit Feedback and of Customers Preferences’ Forecasting

In [6] it was proposed to introduce the following concepts:

                   ‒ customer preferences:

                  ‒ level of confidence.

   The confidence level is calculated by using the value        (feedback, purchases,
etc.), which gives more confidence more often, when more often subject interacts
with the object. The level of confidence increases due to the linear scaling factor α
(which is a "hyperparameter" model). In the confidence level, the constant 1 is always

added, indicating                    [6].
  Then, the mathematical model of the task loss function is formed as:

                                                                                (3)

The component                              is needed to regularize the model in such a
way as to prevent retraining. The exact value of the parameter λ depends on the data
and is determined by cross-validation.
122


   The loss function contains           values. For typical data sets, this value can be
several billions. This enormous amount of values impedes most direct methods of
optimization, such as stochastic gradient descent, which is widely used for explicit
data collection. Therefore, in [6] were suggested an alternative effective process of
optimization. If the entities and their profiles or objects and their profiles are fixed,
then the loss function becomes quadratic and can be calculated. This statement leads
to the use of the method of alternating least squares [6].

5.1 Predicting Client Preferences

  After calculating the preferences profiles of objects and subjects, one can recom-
mend to a particular subject     are    available objects with the highest values of
weight       ‒ the predicted preferences of the customer u of the product o, namely:
                                                          (4)
  where      ‒ symbolizes the predicted preferences of the object of .

5.2. The Task of Finding Related Products
   Search for related products can be reformulated as a search for such products that
are similar to preferences and which they satisfy to customers. Denote the numeric
expression as            :
                                                                 (7)


6      Preliminary Data Preparation

A set of inputs is a statistical information about customers who have purchased cer-
tain products. The following characteristics are collected: the unique client identifier
in the system (Kunden_Id), the categorical variable for clients’ gender (Geschlecht_Id
- contains gaps), the name (Ort) and index (Plz) of the client's city (incomplete data),
the client’ birthdate (Geburtsdatum - contains gaps), the unique product identifier
(Artikelnummer), the price (Produkt_Preis) and the date of sale (Rechnungsdatum) of
the product and the quantity of the product purchased (Anzahl).
   That is, there are 4 characteristics with possibly incorrect or missed / lost data. In
order to properly handle them, it is proposed to perform a deep analysis of the causes
of the gaps’ occurrence and to use the combined method for incomplete and lost data
recovering, which is proposed by the author. Method consists following steps.
   1 step. Estimation of the data incompleteness of the sample for each characteristic
by the criterion for estimating the number of passes.
   If I j (missin g )  20% then, the variable-characteristic is excluded from the simula-
tion and missing values for this characteristic do not make sense to recover.
   2 step. Analysis of variables and systematic appearance of missed values
   2.1. For a categorical variable assigning missed values to a separate category - fill-
ing the spaces with the value:
    Vкатегор: ”Missing”
                                                                                                    123


   2.2. For all numerical variables with gaps we analyze their appearance (S-
systematic):
               1  for systemiatic passes, where I j (missin g )  5%
    Sj                                                                     .
               0  for non  systematic passes, where I j (missin g )  5%
       num

   3 step. Analysis of Causes and Effects.
   A Bayesian network is used to establish causal relationships between variables and
analyze the consequences of the missing value occurrence. Target (predicted) variable
for Bayesian Network – effects.
   3.1. To analyze the causes and consequences of the occurrence of passes:
           1  random
    C j  2  critical
           3  catastrophical

   3.2. If for j-th variable S j            0 , С j  1 , then all i-th missed values are re-
                                     num
placed as:
                                   v j1 
                                        
       0                          v j2 
   vj 
     i
                 , where V j
                             num
                                        ‒ i-th value is missed.
       as mod e                        
                                  v 
                                   j4 
   3.3. Otherwise, a regression equation is used to predict values.
   4 step. Regression modeling.
   For linear models the representation in the form of first order autoregression:
                               y(k )  a0  a1 y(k  1)  (k ) , E [(k )]  0 .
   Then, the forecast for one step could be calculated:
                                       y(k  1)  a0  a1 y(k )  (k  1) ,
   If the coefficients a0 , a1 are known, then the forecast as a conditional mathemati-
cal expectation is formed as:
         yˆ (k  1, k )  Ek [ y(k  1)]  Ek [ y(k  1) | y(k ), y(k  1),..., (k ), (k  1),... ] 
                                       a0  a1 Ek [ y(k )]  a0  a1 y(k ) ,
   For s-steps the forecast is calculated by the function:
                                                    S 1 i                     S 1
                                                 
           yˆ (k  s, k )  E S [ y (k  s )]  a0 
                                                                              
                                                         a1   a1 S y (k )  a0
                                                            
                                                                                      a1i  a1 S y (k ) .
                                                    i 0                       i 0
   The sequence of forecasts is a convergent process if the condition a1  1 is ful-
                                     a0
filled, that is: lim Ek [ y (k  s)]     , a1  1
                 s               1  a1
   Extension of the forecasting function in the process of autoregression AR(p):
                                                          p
                                yˆ (k  s, k )  a 0     a i yˆ (k  s  i ) ,
                                                         i 1
124


  where yˆ (k  s  i)  E k [ y (k  s  i ) ] .
   5 steps. Recovered data application for next simulation.
   Applying the proposed method to categorical variable Geschlecht_Id on 2 step, fill
it out as "Missing", in this case - 0, and assume that this is "Gender not specified"
category. For variables city and zip code (Ort, Plz), fill all gaps as "unknown" and
delete all characters except numbers. For the birthdate variable Geburtsdatum, the
direct reversals of the gap due to the regression model forecast are not considered
appropriate since this characteristic is significant and the gaps can be systematic.
While the "age" is usually perceived as the number of full years, it makes sense to do
the following. Fill the gaps with the values that correspond to the first day of the first
month for the current year. Next, we create a new "Age" variable, which is calculated
as the difference between the values of the year in the Geburtsdatum variable and the
value of the year for current date. This variable will be an integer and will be greater
than or equal to zero. A zero will display a separate case of missed data.
   In fig. 1 the visualization of this data set is presented.


                           Fig.1 Dependence charts of the characteristics
                                                                                          125


   Charts of dependencies between characteristics show that:
   ˗ in the customer database is almost equal quantity of women and men, with a mi-
nority of women;
   ˗ the distribution of orders among cities is close to uniform;
   ˗ the distribution of age values for clients has an average value of 61 years old.
   A set of attributes for the implementation of collaborative filtering is the following:
Kunden_Id, Artikelnummer, Anzahl. It is important to note that quality testing of
recommendations will be completed in 2 stages:
   ˗ expert of the company sets different user IDs and subjectively analyzes the issu-
ance of recommendations;
      ˗ if the first stage is successful, the next stage is performed, namely, the analysis of
    the accuracy of the model through the Precision@k and MeanAveragePrecision@k


7        Results of Collaborative Filtration Model Based on
         Alternating Least Squares (ALS)

Cross-validation, sometimes called cross-check, is a technique of verifying how suc-
cessfully the statistical analysis by the model is able to work on an independent da-
taset. Usually, cross-validation is used when the purpose is foresight, and it is im-
portant to assess how prognostic model is capable for practice. Cross validation is a
way to evaluate the ability of the model to work on a hypothetical test set when it is
impossible to obtain such a set explicitly [7].
   The model proposed in this paper has hyper parameters:
        Number of iterations - i;
        Number of hidden factors - t;
        The value of the regularization factor is λ;
   In all experiments, the quality function will be Precision@k. A "grid" of parameter
values is formed as: i - moves from 5 to 50 with a step of 5; t = 50; λ = 0.01.
   After performing of 10 iterations, changing the value of the number of iterations,
the Precision@k dependency graph is built in dependence from the value of the ALS
count iteration indicator and the MeanAveragePrecision@k dependency graph from
the ALS count value. By obtaining the first approximation of the optimal iterations
number, it is fixed and the optimal value of the regularization coefficient is found.
The next "grid" of the parameter values is: i = 10; t = 50; λ - moves from 0.01 to 1
with the step 0.01.
   After performing 10 iterations, by changing the value of the regularization factor
index, a plot of the dependence of the Precision@k indicator on the value of the indi-
cator λ is built. By changing the value of the indicator, the coefficient of regulariza-
tion, the dependence of the MAP@k parameter on the value of the indicator λ was
constructed. By obtaining the first approximation to the optimal number of iterations
and the value of the regularization coefficient, they are fixed and the optimal value of
the number of hidden factors is found. The next "net" of the parameter values is
formed: i = 10; t - moves from 100 to 1600 in increments of 100; λ = 0.09.
126


   By completing 10 iterations, changing the value of the indicator, the coefficient of
the number of hidden factors, we obtained a graph dependence of the Precision@k
indicator from the value of t (Fig. 2).


      Fig. 2 The dependence of Precision @ k on the amount of latent factors

   Similarly, by performing 10 iterations, changing the value of the indicator, the co-
efficient of the number of hidden factors, the plot of the dependence of the indicator
MAP@k on the value of the indicator t (Fig. 3) is constructed.


  Fig. 3 The dependence of MeanAveragePrecision @ k on the amount of latent factors
                                                                                      127


8       Analysis of the Results

Both accuracy functions have a declining character, which is expected, since, with the
increase in the number of optimization iterations for each of the model components,
the model retraining takes place. For Precision@k, the optimal number of iterations is
15, for the statistic MeanAveragePrecision@k the number of iterations in the value of
10 is optimal. Therefore, the least of these values was selected. The optimal value for
the regularization factor is 0.09 and the values of the received accuracy indicators
correlate with each other.
   The optimal value of the number of latent factors is 900 (Fig. 2 and 3). The de-
pendency function is increasing to a certain level and after that the level is almost in
the same range. This indicates that an optimal number of hidden factors for the given
set of data was determined. The number of latent factors is the most important indica-
tor of this system. This is confirmed by the increase in the quality predictions from
67% to 83%. It should be noted that all values of the hyperparameters of the model
are relevant only for the set of data that was investigated during the experiment. For
new samples, the process of analyzing data on other sets should start again from the
beginning according to the algorithm described above.


9      Conclusions

Ensuring customer loyalty to the company is now a key priority in shaping the rela-
tionship between the company and its customers, providing them with high quality
products and services which they need. Determining the users’ needs and making
recommendations when customer is choosing the product it is the main factor in the
formation of business models for many companies. Development and using of rec-
ommendation systems in the e-commerce market is currently very relevant [9, 10].
The advices of such systems allow companies to use collaborative filtering levers and
feature-based recommendations to better serve their customers and increase sales.
   Many approaches and methods for recommendation systems constructing have
been developed. Most techniques are limited by the fact that they are not able to work
on such data as statistics of goods sales, etc. It is necessary to analyze the behavior of
users, and for this to determine the key factors of their behavior. Since the entry into
force of the "General Data Protection Regulation" Act (GDPR, the European Union),
as of May 25, 2016 [8], the personal users’ data should not be used without the con-
sent of clients, and therefore such customers should be "forgotten." It turned out to be
difficult to work on advisory systems for specific clients. The way of solving this
problem is to reclassify clients as 1/0 (old / new client) and extract information from
"implicit feedback on binary data", such as, for example, de-personalized statistics on
the sale of goods, becoming relevant again. Commercial companies have statistics,
therefore, even without the influence of GDPR, the task is relevant.
   The methods of searching hidden factors allow solving the problem of analyzing
and predicting customer preferences, so are recommended for using in modern busi-
ness solutions [2, 11].
128


References
 1. Kuznietsova N. V.: Information Technologies for Clients’ Database Analysis and Behav-
    iour Forecasting. In: CEUR Workshop. Selected Papers of the XVII International Scien-
    tific and Practical Conference on Information Technologies and Security (ITS 2017), 2017,
    Vol. 2067, pp. 56-62. http://ceur-ws.org/Vol-2067/, last accessed 2018/11/11.
 2. Recommendation Systems. Laboratory of Mathematical Logic at PDMI RAS,
    https://logic.pdmi.ras.ru/~sergey/slides/N16_AIRush.pdf, last accessed 2018/11/11.
 3. Data      Mining     Concepts.      Microsoft    Documentation      Library   Homepage,
    https://docs.microsoft.com/en-us/sql/analysis-services/data-mining/data-mining-
    concepts?view=sql-server-2017, last accessed 2018/11/11.
 4. Data Mining. Base Group Labs Homepage, https://basegroup.ru/community/articles/data-
    mining, last accessed 2018/11/11.
 5. Воронцов К. В. Коллаборативная фильтрация: видеолекции. Школа Анализа Данных
    Яндекс, https://www.youtube.com/watch?v=kfhqzkcfMqI, last accessed 2018/11/11.
 6. Hu Y., Koren Y., Volinsky C.: Collaborative Filtering for Implicit Feedback Datasets. In
    International Conference on Data Mining 2008, pp. 263-272. Eight IEEE (2008).
    http://yifanhu.net/PUB/cf.pdf, last accessed 2018/11/11.
 7. Cross Validation. LONG/SHORT Blog, http://www.long-short.pro/post/kross-validatsiya-
    cross-validation-304, last accessed 2018/11/11.
 8. EU General Data Protection Regulation Homepage, https://eugdpr.org/, last accessed
    2018/11/11.
 9. Deshpande M., Karypis G.: Item-based top-N recommendation algorithms, ACM Transac-
    tions on Information Systems, vol. 22, pp. 143-177, (2004).
10. Takacs G., Pilaszy I., Nemeth B., Tikk D.: Major Components of the Gravity Recommen-
    dation System, SIGKDD Explorations 9 , pp. 80–84, (2007).
11. Вандер П. Дж.: Python для сложных задач: наука о данных и машинное обучение.
    Спб.: Питер, 576 с. (2018).

</pre>