=Paper=
{{Paper
|id=Vol-2318/paper10
|storemode=property
|title=Analytical Technologies for Clients’ Preferences Analyzing with Incomplete Data Recovering
|pdfUrl=https://ceur-ws.org/Vol-2318/paper10.pdf
|volume=Vol-2318
|authors=Nataliia Kuznietsova
|dblpUrl=https://dblp.org/rec/conf/its2/Kuznietsova18
}}
==Analytical Technologies for Clients’ Preferences Analyzing with Incomplete Data Recovering==
Analytical Technologies for Clients’ Preferences Analyzing with Incomplete Data Recovering Nataliia Kuznietsova1 1 Institute for Applied System Analysis of National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv, Ukraine natalia-kpi@ukr.net Abstract. The paper is devoted to new analytical information technologies for clients’ preferences prediction. The problem of clients’ preferences forecasting is now actual for many commercial systems, companies, banks, insurance com- panies and e-commerce. Various marketing efforts are used to increase demand and attract new customers. The main idea is to understand the customer needs and preferences and to model their behavior with analytical technologies. Such technologies gives the possibility of the clients’ data analysis, customer demand evaluation and prediction of the next purchases. The modern approaches to cli- ents’ preferences prediction were analyzed and collaborative filtering methods were chosen. The formulation of the task modelling in terms of clients as sub- jects and purchases as objects was fulfilled. The method for incomplete and missing data recovering, which was proposed by the author, consists of such stages as sample incompleteness evaluation, analysis if the passes are systemic, analyzing of the passes’ causes and effects with using Bayesian network, re- gression modelling for passes recovering. The method of implicit feedback with the combined method for incomplete and missed data processing were built in the existing modern ERP-system and gives the possibility to receive highest ac- curacy of clients’ preferences prediction. Keywords: Implicit feedback, Missed data, Collaborative filtering, Data Re- covering, Clients’ preferences. 1 Introduction The economic growth of any country, the further development of the economy is accompanied by increasing of the incomes and profits for companies and countries’ residents at the same time. There is also a significant development of the customer service sphere, increasing demand of the different products. The client-oriented strategies become the priority for customer service companies. The corporations are included in the intensive competition for clients. The key is the understanding of customer needs and preferences [1, 2], modeling of consumer behavior, and therefore, most companies and corporations invest heavily in developing their own solutions or purchasing existing ones [3,4]. Such techniques allow them to analyze customer 119 preferences, their pre-orders and develop models that will include interesting for customers products, which can be offered to them for the next order. 2 Clients’ Preferences Analysis A foreign company specializing in the sale of media products - CDs, DVDs, as well as elite segment products has developed its own Enterprise Resources Planning (ERP) system for collecting statistical information about online store customers, catalogs of goods and knowledge bases on the previous experience. Knowledge are in form of related products recommendations and new acquisitions on the basis of the previous goods purchased in the online store. The company has more than 2 million unique customers and more than 5 million orders statistic. Various marketing efforts such as sending emails are used to increase demand among its users and to attract new customers. The formation of recommendations for related products is carried out by an analyst who selects goods by his own algorithm. Such recommendations are not really precise while people are not able to process such volumes of information. Analyst uses company's ERP system as an automated workplace. The analytics recommendations are stored in the knowledge base of the company, together with information, whether the client used this recommendation. The company database includes user information, product information, and date of purchase. The database contains missed and lost data. The following simulation tasks are relevant for the company: Forecasting: sales estimation, server load or server downtime forecasting for providing quick user access to the order directory. Analysis and risk assessment and minimization: selecting the most promising clients for target e-mailing (the risk of choosing customer who will not buy goods. The losses are calculated as the amount spent on such ineffective mailings). Providing recommendations: identifying products which can be sold together with the high probability, creating recommendations for preferences. Sequency search: customer choice analysis while making purchases, forecasting the next possible event. Grouping: dividing customers or events into clusters of related elements, analyzing and predicting common features. Modern approaches to analyzing and forecasting the preferences and behaviors of clients within the framework of the advisory systems are ideologically divided into the following types [2,5]: 1. Collaborative filtering methods. 2. Knowledge-based filtering. 3. Methods based on content analysis (Content-based filtering). 4. hybrid methods. Collaborative filtering is the process of filtering information or samples by sharing multiple technologies, points of review, data sources, and more. Collaborative filtering is usually associated with very large data sets and therefore is appropriate for 120 using in financial systems, Such systems provide financial services, process large amounts of information and combine a large number of financial data sources. In the narrower sense, collaborative filtering is one of the methods for forecast constructing in recommendation systems that uses well-known user group estimates to predict unknown user ratings [3]. The main assumption of collaborative filtration is as follow: those who have equally evaluated any objects in the past tend to give similar assessments of other subjects in the future. 3 Problem Statement The following problem was solved: to develop the information technology for auto- mation the process of recommendation providing on the accompanying products and next products interesting for the clients. Let there exist a matrix R of size with the subjects (clients), objects (goods), and some feedback data (previous orders). It is necessary to find a way of transforming it into one matrix with subjects and their profiles (hidden preferences) and one matrix with objects and their profiles (hidden preferences that they satisfying) The P and Q matrices contain scales that deter- mine how each subject / object relates to each t. The task is to calculate P, Q in such way that their multiply approximates R as closely as possible: . In the process of iterative assignment of random values in the matrices P and Q, us- ing the method of least squares (LS), we must arrive to the same value of the scales that most closely approximate the matrix R. In the LS algorithm consistently, at each iteration, the following states of the sys- tem alternately change: - P is fixed, then optimizing Q; - Q is fixed, then P is optimizing. This operation is continued until approximation to R≈P × Q will be reached. 4 Criteria for Quality Assessing of the Customer Preferences’ Prediction Standard criteria for estimating the quality of forecast such as RMSE or MAE, couldn’t be used to assess the accuracy of the solution to the problem of analyzing and predicting customer preferences. It is difficult to estimate if there is a mistake in forecasting model of the clients’ preferences or it is the client's decision not to buy this product here and now. Maybe in the future, this customer will buy this product later. It would be advisable to organize the collection of statistical information about the fact of the reference product review, but this is also an indirect characteristic, since the client may not have enough time, but the product is interesting for him, so the recommendation is correct for this client. Such criteria for quality recommenda- tions evaluation [5] were used: 121 Accuracy: (1) where the number of recommended objects with which the subject has an interac- tion (that is, the number of correctly predicted preferences); number of recom- mendations. This criterion indicates which rate of recommendations corresponds to the preferences of the subject. Completeness: (2) where the number of recommended objects with which the subjects had an inter- action, and the total number of interactions that was performed by the subjects. Recall@k ‒ evaluates which fraction of interactions performed by clients corre- sponds to the predicted interactions, i.e. how many of the forecasted picked goods were interesting to customers. It is possible to evaluate these equation in cash equivalents, by setting the cost of each interaction and fines for lack of interaction. 5 Implicit Feedback and of Customers Preferences’ Forecasting In [6] it was proposed to introduce the following concepts: ‒ customer preferences: ‒ level of confidence. The confidence level is calculated by using the value (feedback, purchases, etc.), which gives more confidence more often, when more often subject interacts with the object. The level of confidence increases due to the linear scaling factor α (which is a "hyperparameter" model). In the confidence level, the constant 1 is always added, indicating [6]. Then, the mathematical model of the task loss function is formed as: (3) The component is needed to regularize the model in such a way as to prevent retraining. The exact value of the parameter λ depends on the data and is determined by cross-validation. 122 The loss function contains values. For typical data sets, this value can be several billions. This enormous amount of values impedes most direct methods of optimization, such as stochastic gradient descent, which is widely used for explicit data collection. Therefore, in [6] were suggested an alternative effective process of optimization. If the entities and their profiles or objects and their profiles are fixed, then the loss function becomes quadratic and can be calculated. This statement leads to the use of the method of alternating least squares [6]. 5.1 Predicting Client Preferences After calculating the preferences profiles of objects and subjects, one can recom- mend to a particular subject are available objects with the highest values of weight ‒ the predicted preferences of the customer u of the product o, namely: (4) where ‒ symbolizes the predicted preferences of the object of . 5.2. The Task of Finding Related Products Search for related products can be reformulated as a search for such products that are similar to preferences and which they satisfy to customers. Denote the numeric expression as : (7) 6 Preliminary Data Preparation A set of inputs is a statistical information about customers who have purchased cer- tain products. The following characteristics are collected: the unique client identifier in the system (Kunden_Id), the categorical variable for clients’ gender (Geschlecht_Id - contains gaps), the name (Ort) and index (Plz) of the client's city (incomplete data), the client’ birthdate (Geburtsdatum - contains gaps), the unique product identifier (Artikelnummer), the price (Produkt_Preis) and the date of sale (Rechnungsdatum) of the product and the quantity of the product purchased (Anzahl). That is, there are 4 characteristics with possibly incorrect or missed / lost data. In order to properly handle them, it is proposed to perform a deep analysis of the causes of the gaps’ occurrence and to use the combined method for incomplete and lost data recovering, which is proposed by the author. Method consists following steps. 1 step. Estimation of the data incompleteness of the sample for each characteristic by the criterion for estimating the number of passes. If I j (missin g ) 20% then, the variable-characteristic is excluded from the simula- tion and missing values for this characteristic do not make sense to recover. 2 step. Analysis of variables and systematic appearance of missed values 2.1. For a categorical variable assigning missed values to a separate category - fill- ing the spaces with the value: Vкатегор: ”Missing” 123 2.2. For all numerical variables with gaps we analyze their appearance (S- systematic): 1 for systemiatic passes, where I j (missin g ) 5% Sj . 0 for non systematic passes, where I j (missin g ) 5% num 3 step. Analysis of Causes and Effects. A Bayesian network is used to establish causal relationships between variables and analyze the consequences of the missing value occurrence. Target (predicted) variable for Bayesian Network – effects. 3.1. To analyze the causes and consequences of the occurrence of passes: 1 random C j 2 critical 3 catastrophical 3.2. If for j-th variable S j 0 , С j 1 , then all i-th missed values are re- num placed as: v j1 0 v j2 vj i , where V j num ‒ i-th value is missed. as mod e v j4 3.3. Otherwise, a regression equation is used to predict values. 4 step. Regression modeling. For linear models the representation in the form of first order autoregression: y(k ) a0 a1 y(k 1) (k ) , E [(k )] 0 . Then, the forecast for one step could be calculated: y(k 1) a0 a1 y(k ) (k 1) , If the coefficients a0 , a1 are known, then the forecast as a conditional mathemati- cal expectation is formed as: yˆ (k 1, k ) Ek [ y(k 1)] Ek [ y(k 1) | y(k ), y(k 1),..., (k ), (k 1),... ] a0 a1 Ek [ y(k )] a0 a1 y(k ) , For s-steps the forecast is calculated by the function: S 1 i S 1 yˆ (k s, k ) E S [ y (k s )] a0 a1 a1 S y (k ) a0 a1i a1 S y (k ) . i 0 i 0 The sequence of forecasts is a convergent process if the condition a1 1 is ful- a0 filled, that is: lim Ek [ y (k s)] , a1 1 s 1 a1 Extension of the forecasting function in the process of autoregression AR(p): p yˆ (k s, k ) a 0 a i yˆ (k s i ) , i 1 124 where yˆ (k s i) E k [ y (k s i ) ] . 5 steps. Recovered data application for next simulation. Applying the proposed method to categorical variable Geschlecht_Id on 2 step, fill it out as "Missing", in this case - 0, and assume that this is "Gender not specified" category. For variables city and zip code (Ort, Plz), fill all gaps as "unknown" and delete all characters except numbers. For the birthdate variable Geburtsdatum, the direct reversals of the gap due to the regression model forecast are not considered appropriate since this characteristic is significant and the gaps can be systematic. While the "age" is usually perceived as the number of full years, it makes sense to do the following. Fill the gaps with the values that correspond to the first day of the first month for the current year. Next, we create a new "Age" variable, which is calculated as the difference between the values of the year in the Geburtsdatum variable and the value of the year for current date. This variable will be an integer and will be greater than or equal to zero. A zero will display a separate case of missed data. In fig. 1 the visualization of this data set is presented. Fig.1 Dependence charts of the characteristics 125 Charts of dependencies between characteristics show that: ˗ in the customer database is almost equal quantity of women and men, with a mi- nority of women; ˗ the distribution of orders among cities is close to uniform; ˗ the distribution of age values for clients has an average value of 61 years old. A set of attributes for the implementation of collaborative filtering is the following: Kunden_Id, Artikelnummer, Anzahl. It is important to note that quality testing of recommendations will be completed in 2 stages: ˗ expert of the company sets different user IDs and subjectively analyzes the issu- ance of recommendations; ˗ if the first stage is successful, the next stage is performed, namely, the analysis of the accuracy of the model through the Precision@k and MeanAveragePrecision@k 7 Results of Collaborative Filtration Model Based on Alternating Least Squares (ALS) Cross-validation, sometimes called cross-check, is a technique of verifying how suc- cessfully the statistical analysis by the model is able to work on an independent da- taset. Usually, cross-validation is used when the purpose is foresight, and it is im- portant to assess how prognostic model is capable for practice. Cross validation is a way to evaluate the ability of the model to work on a hypothetical test set when it is impossible to obtain such a set explicitly [7]. The model proposed in this paper has hyper parameters: Number of iterations - i; Number of hidden factors - t; The value of the regularization factor is λ; In all experiments, the quality function will be Precision@k. A "grid" of parameter values is formed as: i - moves from 5 to 50 with a step of 5; t = 50; λ = 0.01. After performing of 10 iterations, changing the value of the number of iterations, the Precision@k dependency graph is built in dependence from the value of the ALS count iteration indicator and the MeanAveragePrecision@k dependency graph from the ALS count value. By obtaining the first approximation of the optimal iterations number, it is fixed and the optimal value of the regularization coefficient is found. The next "grid" of the parameter values is: i = 10; t = 50; λ - moves from 0.01 to 1 with the step 0.01. After performing 10 iterations, by changing the value of the regularization factor index, a plot of the dependence of the Precision@k indicator on the value of the indi- cator λ is built. By changing the value of the indicator, the coefficient of regulariza- tion, the dependence of the MAP@k parameter on the value of the indicator λ was constructed. By obtaining the first approximation to the optimal number of iterations and the value of the regularization coefficient, they are fixed and the optimal value of the number of hidden factors is found. The next "net" of the parameter values is formed: i = 10; t - moves from 100 to 1600 in increments of 100; λ = 0.09. 126 By completing 10 iterations, changing the value of the indicator, the coefficient of the number of hidden factors, we obtained a graph dependence of the Precision@k indicator from the value of t (Fig. 2). Fig. 2 The dependence of Precision @ k on the amount of latent factors Similarly, by performing 10 iterations, changing the value of the indicator, the co- efficient of the number of hidden factors, the plot of the dependence of the indicator MAP@k on the value of the indicator t (Fig. 3) is constructed. Fig. 3 The dependence of MeanAveragePrecision @ k on the amount of latent factors 127 8 Analysis of the Results Both accuracy functions have a declining character, which is expected, since, with the increase in the number of optimization iterations for each of the model components, the model retraining takes place. For Precision@k, the optimal number of iterations is 15, for the statistic MeanAveragePrecision@k the number of iterations in the value of 10 is optimal. Therefore, the least of these values was selected. The optimal value for the regularization factor is 0.09 and the values of the received accuracy indicators correlate with each other. The optimal value of the number of latent factors is 900 (Fig. 2 and 3). The de- pendency function is increasing to a certain level and after that the level is almost in the same range. This indicates that an optimal number of hidden factors for the given set of data was determined. The number of latent factors is the most important indica- tor of this system. This is confirmed by the increase in the quality predictions from 67% to 83%. It should be noted that all values of the hyperparameters of the model are relevant only for the set of data that was investigated during the experiment. For new samples, the process of analyzing data on other sets should start again from the beginning according to the algorithm described above. 9 Conclusions Ensuring customer loyalty to the company is now a key priority in shaping the rela- tionship between the company and its customers, providing them with high quality products and services which they need. Determining the users’ needs and making recommendations when customer is choosing the product it is the main factor in the formation of business models for many companies. Development and using of rec- ommendation systems in the e-commerce market is currently very relevant [9, 10]. The advices of such systems allow companies to use collaborative filtering levers and feature-based recommendations to better serve their customers and increase sales. Many approaches and methods for recommendation systems constructing have been developed. Most techniques are limited by the fact that they are not able to work on such data as statistics of goods sales, etc. It is necessary to analyze the behavior of users, and for this to determine the key factors of their behavior. Since the entry into force of the "General Data Protection Regulation" Act (GDPR, the European Union), as of May 25, 2016 [8], the personal users’ data should not be used without the con- sent of clients, and therefore such customers should be "forgotten." It turned out to be difficult to work on advisory systems for specific clients. The way of solving this problem is to reclassify clients as 1/0 (old / new client) and extract information from "implicit feedback on binary data", such as, for example, de-personalized statistics on the sale of goods, becoming relevant again. Commercial companies have statistics, therefore, even without the influence of GDPR, the task is relevant. The methods of searching hidden factors allow solving the problem of analyzing and predicting customer preferences, so are recommended for using in modern busi- ness solutions [2, 11]. 128 References 1. Kuznietsova N. V.: Information Technologies for Clients’ Database Analysis and Behav- iour Forecasting. In: CEUR Workshop. Selected Papers of the XVII International Scien- tific and Practical Conference on Information Technologies and Security (ITS 2017), 2017, Vol. 2067, pp. 56-62. http://ceur-ws.org/Vol-2067/, last accessed 2018/11/11. 2. Recommendation Systems. Laboratory of Mathematical Logic at PDMI RAS, https://logic.pdmi.ras.ru/~sergey/slides/N16_AIRush.pdf, last accessed 2018/11/11. 3. Data Mining Concepts. Microsoft Documentation Library Homepage, https://docs.microsoft.com/en-us/sql/analysis-services/data-mining/data-mining- concepts?view=sql-server-2017, last accessed 2018/11/11. 4. Data Mining. Base Group Labs Homepage, https://basegroup.ru/community/articles/data- mining, last accessed 2018/11/11. 5. Воронцов К. В. Коллаборативная фильтрация: видеолекции. Школа Анализа Данных Яндекс, https://www.youtube.com/watch?v=kfhqzkcfMqI, last accessed 2018/11/11. 6. Hu Y., Koren Y., Volinsky C.: Collaborative Filtering for Implicit Feedback Datasets. In International Conference on Data Mining 2008, pp. 263-272. Eight IEEE (2008). http://yifanhu.net/PUB/cf.pdf, last accessed 2018/11/11. 7. Cross Validation. LONG/SHORT Blog, http://www.long-short.pro/post/kross-validatsiya- cross-validation-304, last accessed 2018/11/11. 8. EU General Data Protection Regulation Homepage, https://eugdpr.org/, last accessed 2018/11/11. 9. Deshpande M., Karypis G.: Item-based top-N recommendation algorithms, ACM Transac- tions on Information Systems, vol. 22, pp. 143-177, (2004). 10. Takacs G., Pilaszy I., Nemeth B., Tikk D.: Major Components of the Gravity Recommen- dation System, SIGKDD Explorations 9 , pp. 80–84, (2007). 11. Вандер П. Дж.: Python для сложных задач: наука о данных и машинное обучение. Спб.: Питер, 576 с. (2018).