Comparison of Underlying Algorithms in Recommendation Systems Shuyi Wang1, Mufeng Yang2*, Ziyang Liu3, and Linyan Li4 1 Department of Mathematics, University of North Carolina at Chapel Hill, Chapel Hill, USA 2 Department of Computer Science, Fu Jen Catholic University, Taiwan, China 3 Department of Mathematics, University of California, Santa Barbara, Santa Barbara, USA 4 Department of Mathematics, Henan University of Economics and Law, Henan, China * Corresponding author: yangmufeng233@gmail.com Abstract Recommendation systems have been one of the most popular fields in machine learning and received significant attentions due to its applicability to everyday lives. This paper reviews the developing history of recommendation systems from CF to more recent LS-PLM. In this work, we first introduce representative algorithm in each time period and shed light on how they evolved with various demands of mobile applications and scientists’ increasing understanding of general machine learning. To concretely illustrate each algorithm, we present the basic framework, underlying mechanics, application methods, and potential weaknesses for comparison. We also run numerical experiments on four of the introduced algorithms, FM, FFM, GBDT+LR, and LS-PLM to test their performance on a particular dataset, KKBox’s Music Recommendation Challenge. Finally, we conclude with higher accuracy of FFM and attribute its advantage to better dealing with sparser features. This work presents meaningful viewpoints on comparison of recommendation algorithms through descriptions and experiments and further facilitates the emergence of new state-of-the-art algorithms. Keywords Factorization Machine, Field-aware Factorization Machine, Gradient Boost Decision Tree+Linear Regression, Large Scale Piece-wise Linear Model 1. Introduction Since the concept of online shopping first appeared in the late 20 century, it has played an essential role in daily business. For example, Amazon has over 300 million active customer accounts worldwide with 1.6 million packages and 385 dollar sales reached per day! As more and more products are selling online, one problem came up: how to recommend those products to their potential customers who need them? Amazon used a recommendation system that increased about 35% of over sales. In fact, not only goods companies like Amazon but also companies like Instagram, which recommend a person from social media to you, used a recommendation system in their operation. One of the fundamentals of the recommendation system is embedding. Embedding is a relatively low-dimension space into which you can translate high-dimension vectors. Objects are distributed in such space through their feature – one common feature could be one dimension. Such as one dimension shows the size of objects, another dimension shows what age group uses more. Some companies use embedding to process the similarities between objects and display their similarities by distance in the space between them. For example, both burger and sandwich are food, so the distance between burger and sandwich should be smaller than the distance between burger and fossil. Also, the distance between objects should have some meaning in Embedding – the distance between women and men should approximately equal the distance between males and females. After having the relation, or distance, between different objects, the recommendation system would recommend similar goods to customers based on their preference or history purchase record. 2. Pre- Deep Learning Era—Recommender System This is the evolution diagram of the recommender system model. Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Figure 1: Algorithms of recommender system Figure 1 shows the traditional recommendation model consists of the following four components: CF, LR, FFM and the combined model (GBDT+LR). 2.1. Collaborative Filtering CF is a collaborative process of filtering a large amount of information through feedback, reviews and opinions, and filtering the information that may be of interest to the target user. About the development of CF: Tapestry was the first design to apply CF, solving the information load problem of the research center mail system with no way to filter the classification; [1] GroupLens as a milestone, the system was applied to news filtering, helping readers filter their interested news content through the record of readers' rating of the content. Similar web recommender platforms are still prevalent today, like YouTube; it has also achieved outstanding success in the field of e-commerce. CF is mainly divided into User-based, Item-based, Memory-based and Model-based. User-based: First, collect information that represents user interests, usually user ratings or evaluations, which are called "active ratings", and replace user evaluations by the system through user behavior, which is called "passive ratings". The starting point of UserCF is to calculate the similarity between two users. For example to get user A is interested in whether the commodity X, based on the historical evaluation data (user A historical evaluation data for other commodities, and other users of the historical evaluation data) of these commodities to analysis with the user A to find most similar n users, and then the synthetic evaluation of similar users for the commodities, the weighted average of user similarity and similar users' evaluation is usually used to obtain the evaluation prediction of target users. Through the operation of this formula, the final recommendation list can be sorted according to the predicted score. ∑𝑠∈𝑆(𝑤𝑢,𝑠 ∙ 𝑅𝑠,𝑝 ) 𝑅𝑢,𝑠 = (1) ∑𝑠∈𝑆 𝑤𝑢,𝑠 The weight 𝑤𝑢,𝑠 represents the similarity between user u and user s, 𝑅𝑠,𝑝 represents user s's rating of item p. If the rating is negative, the recommender system will not recommend X to A. Otherwise, the commodity X is recommended to the user A. Item-based: Similar to user-based, the historical data information should be collected first, and then calculate the similarity between the evaluated commodities and the commodities with prediction. Find out k commodities that are most similar to the commodities to be predicted, and the similarity should be used as the weight to rank the scores of the evaluated commodities by weighting them, and finally the predicted value of the commodities to be predicted can be obtained. How to calculate the similarity? ⚫ Cosine Similarity measures the vector angle between user vector i and user vector j : 𝑖∙𝑗 sim(𝑖, 𝑗) = cos(𝑖, 𝑗) = ‖𝑖‖⋅‖𝑗‖ (2) ⚫ Pearson's correlation coefficient is modified by using average user scores: ∑𝑝∈𝑃(𝑅𝑖,𝑝 − 𝑅𝑖 )(𝑅𝑗,𝑝 − 𝑅𝑗 ) sim(𝑖, 𝑗) = (3) 2 2 √∑𝑝∈𝑃(𝑅𝑖,𝑝 − 𝑅𝑖 ) √∑𝑝∈𝑃(𝑅𝑖,𝑝 − 𝑅𝑗 ) 𝑅𝑖,𝑝 represents user I's rating of item P,𝑅𝑖 represents user I's average rating of all items, P represents the collection of all objects. ⚫ Based on the idea of Pearson coefficient, the article average score is introduced: ∑𝑝∈𝑃(𝑅𝑖,𝑝 − 𝑅𝑝 )(𝑅𝑗,𝑝 − 𝑅𝑝 ) sim(𝑖, 𝑗) = (4) 2 2 √∑𝑝∈𝑃(𝑅𝑖,𝑝 − 𝑅𝑝 ) √∑𝑝∈𝑃(𝑅𝑖,𝑝 − 𝑅𝑝 ) 𝑅𝑝 represents the average score of all the scores for item P. In traditional collaborative filtering systems, the amount of work increases with the number of participants in the system. [2] One of the disadvantages of UserCF is that the number of users is often greater than the number of items, resulting in the storage overhead of user similarity matrix is very large, and its spatial complexity grows rapidly at the speed of N square times, which is an unbearable expansion speed. And because the data vector is usually very sparse, UserCF is not suitable for applications where positive feedback is difficult to obtain. In combination with the above two types of collaborative filtering, UserCF has stronger social features, which can make a point of interest that is not in the range of their interests before, or quickly update their recommendation list through the dynamic of "friends". It is applicable to hotspot discovery and hotspot trend tracking, and news recommendation scenarios. [2] ItemCF is suitable for applications with stable interest changes, so ItemCF is better than UserCF to recommend videos with similar styles and types. In general, CF is a very intuitive and explicable model. It can filter the information that is difficult for the machine to analyze automatically, share the experience of others, and have the ability to recommend new information. The recommendation is personalized and highly automated. The main problems are as follows: the recommendation quality is poor at the beginning of the system, and its quality depends on the historical data. The head effect of recommendation results is obvious, and the ability to process sparse vectors is weak, and the system is not strong in extensibility, it does not have strong generalization ability. In order to solve the above problems and increase the generalization ability of the model, matrix factorization technique is proposed. 2.2. Matrix Factorization On the basis of co-occurrence matrix of collaborative filtering, the concept of hidden vector is added to strengthen the ability of the model to deal with sparse matrix and solve the main problems of collaborative filtering specifically. As a tool of matrix mathematics, matrix factorization is good at finding information hidden under data. It is mainly used for scoring prediction. In general, we have many users and commodities and express them in the form of a two-dimensional matrix, in which each element represents the rating of the corresponding user on the corresponding commodity. Obviously, such matrix is sparse in real life, because the tastes of the public are different, and each user may not have all the corresponding commodities and score them. Scoring prediction can be regarded as a process of completing the missing elements in the matrix, that is, inferring the missing data according to the existing data in the matrix. [3] Matrix factorization is to complete the matrix better. In essence, it is to decompose a matrix so that two or more small matrices can be found and multiplied back to the original matrix. In terms of application, matrix decomposition can find the implicit characteristics of interaction between different users. Such information is obtained by analyzing users' behaviors instead of being given directly, which is its advantage. Thus, by discovering these underlying characteristics, we can predict the rating of a given user and a given item because the relevant characteristics match. Therefore, we can predict whether users will like something they have never touched before. When using this method, we assume that the number of hidden features is smaller than the number of users and items. If each user is independently associated with a feature, that is, people and objects have nothing in common, then recommendation will have no meaning. Now we have a set U of users, and a set D of items, create a matrix M with dimensionality of (m, n) this matrix can be viewed as a dot product between two matrix U, V with each matrices having dimensions of (m, k) and (k, n). 𝑅̂𝑢𝑖 = 𝑞𝑖𝑇 𝑝𝑢 (5) We classify matrix factorization into Eigen Decomposition, Singular value Decomposition(SVD) and FunkSVD (Gradient Descent). Since Eigen Decomposition can only be applied to square matrices, it is generally not applicable to decompose matrices between users and items. Singular Value Decomposition factorizes the matrix into: T M ≈ Um×k Σk×k Vk×n (6) Σ is a diagonal matrix, in which the larger k elements are taken as hidden features, and other dimensions in Σ as well as dimensions in U and V are deleted, the matrix decomposition of k- dimensional hidden vectors is completed. However, the premise of applying SVD requires that the original matrix must be dense. In general, the co-occurrence matrix in the Internet scene is very sparse. If this method is used, missing data must be filled, so this method is limited. In addition, the computational complexity of traditional decomposition is as high as O(mn2 ), which is unacceptable. To solve these problems, Simon Funk proposed an improved method of FunkSVD, [4] which decomposed the original matrix into three matrices into two low-rank matrices while reducing the complexity: 2 min ∗ ∗ ∑ (rui − 𝑞𝑖𝑇 pu ) (7) q ,p (u,i)∈K In this way, the difference between the original score R and the product of user and item vectors can be minimized, and the original data of the co-occurrence matrix can be retained to the maximum extent, where K is the sample set of all user scores. In order to avoid the over-fitting phenomenon, the objective function of FunkSVD after adding the regularization term of L2 was proposed: 2 min ∗ ∗ ∑ (rui − 𝑞𝑖𝑇 pu ) + 𝜆(‖𝑞𝑖 ‖ + ‖𝑝𝑢 ‖)2 (8) q ,p (u,i)∈K 𝜆 is called the regularization coefficient, and the larger 𝜆 is, the stronger the regularization constraint. Regularization, that is, in order for the model to be more regular and stable, it needs to put some constraints on the model to avoid predicting unpredictable results. Generally, the objective function is solved by gradient descent method. We take partial derivatives of the objective functions with respect to 𝑞𝑖 and 𝑝𝑢 respectively, and get the results: 2(rui − qTi pu )pu − 2 λqi (9) T 2(rui − qi pu )q i − 2 λpu (10) According to the results, the parameters are reversely updated along the gradient: qi ← qi − γ ((rui − qTi pu )pu − λqi ) (11) pu ← pu − γ ((rui − qTi pu )qi − λpu ) (12) λ is the learning rate. At the same time, in order to eliminate the user and item scoring bias, deviation vectors of users and items are often added in matrix factorization: rui = 𝜇 + 𝑏𝑖 + 𝑏𝑢 + qTi pu (13) 𝜇 is the global deviation constant. The resulting dot product, 𝑞𝑖𝑇 𝑝𝑢 , captures the interaction between user u and item i—the user’s overall interest in the item’s characteristics. [3] The objective function of matrix factorization will also become: min ∑ (rui − 𝜇 − 𝑏𝑖 − 𝑏𝑢 − 𝑝𝑢𝑇 𝑞𝑖 )2 + q∗ ,p∗ ,𝑏∗ (u,i)∈K 2 𝜆(‖𝑞𝑖 ‖ + ‖𝑝𝑢 ‖ + 𝑏𝑢2 + 𝑏𝑖2 ) (14) FunkSVD takes matrix factorization to a new height and is widely used in daily life. For matrix factorization, it is simple to operate, easy to program, low spatial complexity, has good generalization ability, scalability and flexibility, is a good choice to deal with small recommendation system, large recommendation system will not have obvious advantages. At the same time, due to the inconvenience of adding user, item and other related features, matrix factorization loses the opportunity to use a lot of effective information; unable to make effective recommendations in the absence of user behavior. 3. Logistic Regression To make up for the defect of matrix factorization, logistic regression model is developed, which can integrate a variety of different features and generate relatively comprehensive recommendation results. Logistic regression is actually a classification model, mainly used to solve dichotomous problems, used to estimate the probability of something. Its essence is to use maximum likelihood estimation to estimate parameters after assuming data obey a certain distribution. As a generalized linear model, logistic regression can be said to be based on linear regression and introduce nonlinear factors through Sigmoid function, which can easily solve the problem of binary classification. Logistic regression models transform the recommendation problem into a CTR estimation problem. The recommendation process is roughly as follows: Firstly, each feature is converted into a numerical feature vector to determine the optimization objective (taking the optimization of "click rate" as an example). By using the existing sample data to train the model, the internal parameters of the logistic regression model are determined. In the model service stage, feature vectors are input, and the probability of users "clicking" items is obtained through the judgment of the model. Finally, a list of recommendations is made by ranking all items using the probability of "clicks". The emphasis is on model training and online inference using feature vectors of samples. First, about the sigmoid function(also called logistic function): 1 1 𝑓(𝑧) = −𝑧 = −(𝑤∙𝑥+𝑏) (15) 1+𝑒 1+𝑒 Figure 2: Example for the sigmoid value Figure 2 shows the graph of the function. Then input feature vector X, and then assign corresponding weight w to each feature to represent the difference in importance. Sum up each feature by weight to get 𝑥 𝑇 𝑤. Input sigmoid function of 𝑥 𝑇 𝑤 to map it to the range of 0~1 to get the final "click rate". And Gradient descent is a common training method for logistic regression models. It is a first-order optimization algorithm, also known as the fastest descent method, whose purpose is to find the local minimum of a function. So we just take the derivative of the objective function to get the direction of the gradient, and then we go down in the opposite direction of the gradient, and we iterate until we find the local minimum. Input sample x, and the probability of positive sample and negative sample is: P(y = 1|x; w) = fw (x) { (16) P(y = 0|x; w) = 1 − fw (x) y 1−y merge:P(y|x; w) = (fw (x)) (1 − fw (x)) The objective function is obtained according to the maximum likelihood estimation principle: 𝑚 𝐿(𝑤) = ∏ 𝑃(𝑦|𝑥; 𝑤) (17) 𝑖=1 By taking the partial derivative of each parameter after deformation: 𝑚 𝜕 1 𝐽(𝑤) = ∑(𝑓𝑤 (𝑥 𝑖 ) − 𝑦 𝑖 ) 𝑥𝑗𝑖 (18) 𝜕𝑤𝑗 𝑚 𝑖=1 𝑚 1 𝑤𝑗 ← 𝑤𝑗 − 𝛾 ∑(𝑓𝑤 (𝑥 𝑖 ) − 𝑦 𝑖 )𝑥𝑗𝑖 (19) 𝑚 𝑖=1 Finally, let's analyze the advantages of logistic regression model. It is suitable to integrate different features in form to form a comprehensive recommendation result. Secondly, as a kind of generalized linear model, it has the support of numerical meaning. Moreover, it makes the model highly interpretable and can effectively reduce the communication cost. It is also the need of engineering: because it is easy to parallelize, simple model, low training cost and so on, this model occupies the mainstream of engineering field. As a basic model, logistic regression is simple, intuitive and easy to operate, but its limitations are also very obvious. Its expression ability is not strong, unable to carry out a series of relatively advanced operations such as feature crossing and feature screening, inevitably resulting in the loss of information. In order to solve these problems, recommendation systems continue to develop towards the direction of complexity, derived from the Field-aware Factorization Machine (FFM) and other high-dimensional complex models. 4. pOLY2, fm, and ffm 4.1. Poly 2 By using Poly2 for mapping can effectively capture the information of feature conjunctions. Also, Poly 2 is a feature Cross that use for the recommendation system. Here is the mathematical equation: 𝑛 𝑛 ØPoly2(𝑤, 𝑥) = ∑ ∑ 𝑤ℎ (𝑗1 , 𝑗2 )𝑥𝑗1 𝑥𝑗2 (20) 𝑗1 =1 𝑗2 =𝑗1 +1 Where 𝑤ℎ is the weight, and 𝑥𝑗1 𝑥𝑗2 are features. However, Poly2 has some disadvantages. First, when people deal with data from the internet, they will use one-hot coding for the most time. However, one-hot coding would cause the feature vector to be very sparse. One-hot coding is a way to transfer a feature into a vector. For example, Wednesday may be represented by [0 0 1 0 0 0 0]. (The first place represents Monday, the second represent Tuesday...) Thus, if we have lots of features, most of the dimensions would be zero except for one dimension. 2, Using Poly2, the number of weights we have complexity of 𝑂(𝑛2 ) because there are 𝑛 × 𝑘 variables. Since it will consider every possible pairs, makes Poly2 impractical when n is very large. Thus, FM could solve this problem and become more practical. 4.2. FM To solve the problem of Poly2, FM was suggested in 2010 by Rendle. [5] The formula is: 𝑛 𝑛 ØFM(𝑤, 𝑥) = ∑ ∑ (𝑤𝑗1 ⋅ 𝑤𝑗2 )𝑥𝑗1 𝑥𝑗2 (21) 𝑗1 =1 𝑗2 =𝑗1 +1 FM studied a latent vector, 𝑘, for every feature, very similar to matrix decomposition. FM proposed a latent vector for each feature. There exists 𝑘 feature factors in each feature vector, where 𝑘 is a user specified parameter. FM reduced the complexity, which is 𝑛2 in Poly2, to 𝑛𝑘 (𝑛 >> 𝑘), so it is more practical in solving sparse feature vectors. FFM is an updated version of FM. Instead of using one latent vector, FFM uses a group of latent vectors for each feature. Also, FFM uses a concept, field-aware, to make improvements to this model. 4.3. FFM The idea of FFM originated from PITF, which is a recommendation system for personalized labels. Three variables were assumed in PITF. In FFMs, each feature has several potential vectors. Depending on the other features, one of them is used to do the inner product, which is shown in the equation. 𝑛 𝑛 ØFFM(𝑤, 𝑥) = ∑ ∑ (𝑤𝑗1 𝑓2 ⋅ 𝑤𝑗2 𝑓1 )𝑥𝑗1 𝑥𝑗2 (22) 𝑗1 =1 𝑗2 =𝑗1 +1 𝑓1 and 𝑓2 are respectively the field the 𝑗1 and 𝑗2 . As shown in the equation, there are three variables: 𝑛𝑘𝑓. Usually, 𝑘𝐹𝑀𝑀 ≪ 𝑘𝐹𝑀 , and the complexity to compute is 𝑂(𝑛2 𝑘). 4.4. Experiments We choose FM as first algorithm to try out. The reason why we choose Factorization Machines (FM) is mainly because it serves as a approach that at the middle part of the whole recommendation system development. Before FM, there are methods such as Logistic Regression (LR) and Support Vector Machines (SVM). After FM, improved FM with field awareness which called Field-aware Factorization Machines (FFM), DeepFM which combines FM and deep learning algorithm and Gradient Boosting Decision Tree (GBDT) with LR bring recommendation quality to another level. As far as we concerned, FM can be a good start point for us to dive into recommendation system. The similarity between FM and embedding is another reason that leads us to here. Since FM is designed and well known for its prediction on CTR (click-through rate). Firstly, we try to find an dataset and ends up finding a dataset of anime [6] which includes anime's attribute such as name, genre, episode, member, type and average rating and users' rating for anime. Here we use only anime's genre as feature and average rating as its FM's label in order to simulate the situation that CTR have, the situation which only item's features are given. Moreover, we convert the rating to binary value where values over 7.0 are set to 1 and less than 7.0 are set to 0. Secondly, we clean the data by one-hot encoding the genre field. Due to the fact that every anime can have more than one genre, we have to extract the genres name and mapping genre of every anime to 1 in the specific genre name field. After data munging is done, we divide data into training and testing set by KFold with shuffle. When it comes to training, we use an efficient package called xLearn that had already implemented FM. Figure 3: Example for the real valued anime feature vector from the dataset. Every row represents a feature vector of a specific anime with its corresponding target y^n - here is been clicked or not. The first row (blue) represents a anime and its features (genres); the columns (orange) represent indicator variables for each genre. The rightmost column is the target - here the clicked. Figure 3 shows feature vector pattern from dataset [6] The result of training is not very satisfying. While we play around with the hyper parameter, the accuracy or any other valuations did not change a little. We realized that the features in this dataset are not enough and too simple for model to improve itself, thus the training usually face early-stop at early training epoch. Instead of apply FFM to this dataset, we searched for another dataset, which is from KKBox [7], to apply FFM and FM to. We user xLearn [8] package which is focus on fast FM, FFM and LR training. During this time, we found that FFM is sensitive to epoch and number of features and k, which is the number of latent vectors. The result is impressive when k is equal to 4 with 70,000 features and 14 fields. The whole training process can be divided to 2 independent phases data cleaning and train, first we clean up the data, convert features, fields data into int, thus serve as values of latent vector. Before training the data styles that FM model and FFM model receive are libsvm [9,10] and libffm [11]. After cleaning the data, we use xLearn to train and try out as much hyper parameters as we can to find out the best for our dataset. Figure 4: FM’s and FFM’s Accuracy of test data Figure 5: FM’s and FFM’s AUC(Area Under the Curve) of test data Figure 4, 5 show that FFM has better ACC and AUC than FM and the accuracy grow faster than FM when training data ratio is around 1~20%. Figure 6: FM’s and FFM’s AUC(Area Under the Curve) of test data Figure 6 shows that loss of FFM is smaller and decreases faster than FM. 5. Extension of Logistic Regression 5.1. GBDT+LR FM and FFM increase their performance in model’s accuracy by multiplying pairs of features with latent vectors. However, combination of multiple features in reality sometimes determines the output. FM and FFM fall short of this aspect because, once they take into account multiplication of triple combinations, space complexities increase to an intolerable extent. To solve this problem, Facebook in 2014 came up with a method of GBDT+LR to predict Click-Through Rate (CTR), that fits certain combinations of features into models at the same time of avoiding combinatorial explosion [12]. They briefly divided the procedure into 2 independent phases. In Phase I, training data are fed into the Gradient Boosting Descent Tree (GBDT). Here, the divide-and-conquer algorithm generates several trees and sets each tree to fit the loss between former trees and real outputs. However, after training the model of GBDT, the results of predictions are unimportant while the trees themselves play a significant role. Then, the training data are again fed into the trained trees and generate a feature by one-hot encoding based on which leaf they finally fall into. For example, if each tree has 64 leaves, when a piece of data is put into the tree model and falls into the 43rd position, it generates one-hot-encoding with 1 on 43rd position and 0s on every other one. This 64-dimensional feature, together with features generated from other trees, becomes the transformed feature of this piece of data. In Phase II, all transformed features of data are fed into a Logistic Regression algorithm, that finally trains the model and yields predictions. This GBDT+LR works well because of its efficiency in combining features, which is mainly the benefit of using GBDT for transforming feature vectors. Each transformed feature represents not only one feature, but a set of determinant features that come from each node of a tree. It’s because when building the trees in GBDT, they automatically select features that work well together. Thus, while putting these multiple features into Logistic Regression, they’re treated linearly but indeed are meaningful for several features. Figure 7 shows the AUC and F1 curve of GBDT+LR model on test samples with respect to the percentage of training samples. This graph indicates the performance of this model on the dataset stabilizes around 0.65 (F1) and 0.67 (AUC) after training enough data which avoid underfitting. Figure 7: GBDT+LR’s AUC(Area Under the Curve) and F1 of different size test sample 5.2. LS-PLM Large Scale Piece-wise Linear Model (LS-PLM), or Mixed Logistic Regression (MLR), is another effective model that Alibaba once used for its recommendation system. Though the model is still linear, it separates different clusters of training data and applies Logistic Regression to each cluster [13]. LS-PLM consists of two steps. In the first one, a function σ divides the feature space into m parts. Then, within each part, normal Logistic Regression η is applied to features to produce a prediction. The overall output is the sum of products of weights and predictions in each subspace. 𝑚 𝑝(𝑦 = 1|𝑥) = 𝑔 (∑ 𝜎(𝑢𝑗𝑇 𝑥)𝜂(𝑤𝑗𝑇 𝑥)) (23) 𝑗=1 LS-PLM is effective because when training data have obvious group divisions, normal Logistic Regression will treat them in the same way and lose useful information. MLR, on the other hand, fits training data in each group separately and uses different parameters to build the model. The hyperparameter m is arbitrary, where the optimal choice is based on experiments. Alibaba determined their optimal m to be 12, while in the dataset of this paper, the optimal m is approximately 10. Figure 8 shows the AUC curve of LS-PLM model on test samples with respect to the percentage of training samples. This graph presents that the performance of this model is approximately 0.73 (AUC) after sufficient training data are fed, which is slightly better than GBDT+LR. Figure 8: LS-PLM’s AUC(Area Under the Curve) of different size test sample 5.3. Kernel LR Sometimes, the subspaces between different clusters of training data are hard to distinguish using LS-PLM. To solve this problem, a combination of Kernel Method and Logistic Regression can make boundaries easier to divide. In LS-PLM, the space division function σ is usually linear. However, in the reality, the training data may not always have plane boundaries described in terms of linear variables. For example, a dataset of 2-dimention features may have a hyperbola as their boundary. And the equation for hyperbola is caught by normal methods of LS-PLM. Thus, kernel functions are introduced to solve this problem. Kernel functions usually raises the dimension from n (the number of features) to m. The additional dimensions are the function of original dimensions and allow the boundary to be written in linear combination of 𝑥2 𝑦2 them. In the same example, one of the additional dimensions may be 𝑎2 − 𝑏2 , so that the boundary between clusters of data is again a plane in higher dimensions. 6. Challenges IN recommendation system In the application process of recommendation system, user behavior is mostly observational, not experimental. This introduces a lot of bias, if these deviations are not considered, blind data fitting, will lead to a lot of serious problems. Figure 9: Bias type caused by different processes The Figure 9 shows the feedback mechanism of the recommendation system, which can be divided into the following processes. Information about the interaction between users and commodities is collected from users. The collected interaction is expressed as feedback matrix between users and commodities. Generally, there are two kinds of feedback: implicit (if there is interaction, it is 1; otherwise, it is 0) and explicit (users' score on commodities). Based on the collected data to learn the recommendation model, according to the historical data to predict the relative probability; Feedback the results of the recommendation to the user, this process will influence the user's behavior and decision. One of the challenges that exist in the Recommendation system is bias. There are several factors about bias: First, User behavior is observational rather than experimental – user generates behaviors based on exposed items, so the observational data is confounded by the way the recommendation system exposes the data. Second, some items are more popular than others, which would cause the recommendation system to receive more feedback from popular items than others. As a result, those popular items would have a more significant impact on the modal training. Third, since the recommendation system has a feedback loop – it will intensify the bias, cause “the rich get richer.” Usually, users only rate items with the most interest – positive (like) or negative (dislike), and users tend to rate particularly good or bad items. Thus, cause selection bias. Selection bias exists when users decide which items they will rate by their willingness, missing randomness. Conformity bias also comes with rating items. People tend to rate similarly to others in the same group, which will even come against their ideas and judgments. An item with a higher score will receive more high rating comments than it deserves, which will make the better one better, and the worse one worse. Thus, it will not show the true preference of people. Position bias describes the tendency that users to note and interact with certain positions in the recommendation list with higher possibility, especially there are lots of items on the list. Moreover, users tend to trust more on first a few items on the list. Therefore, it fails to show the real preference of users. Those are biases caused by users. However, there are also bias exists in the recommendation system itself. Popularity bias causes popular products to be recommended even more often than they are. In most cases, a small fraction of the whole population of items accounts for the most interaction with users, which is the long-tail phenomenon. 7. Conclusions and future works In this paper, we mainly compare the efficiency between 4 traditional methods used in recommender system. It displays that for certain kinds of data sets, FFMs outperform the GBDT+LR, LS-PLM and FM in terms of loss and accuracy. For the future work, the performance of GBDT+LR in our experiments is worse than that of other models. The reason is probably that while using one-hot encoding on leave nodes (features set), which GBDT generate, the space complexities grow rapidly (fields expands to 10,000) and ends up cost too much time for LR to train. In addition, the huge sparsity caused by one-hot encoding might lead to low accuracy too. When it comes to FM and FFM, it only takes 2.5 minutes for FM and FFM to train 1.8M entries, each epoch takes 1.6 seconds in average. However, the performance of FFM is better than the other two methods (GBDT+LR and LS-PLM) 8. Acknowledgement Throughout the writing of this dissertation I have received a great deal of support and assistance. I would first like to thank my supervisor, Professor David Woodruff, whose expertise was invaluable in formulating the research questions and methodology. Your insightful feedback pushed me to sharpen my thinking and brought my work to a higher level. I would like to acknowledge my teacher assistant Hudson Li from my Algorithm for Big Data course. I want to thank you for your patient support and for all of the opportunities and advice I was given to further my research. I hereby express gratitude to my dear partner Shuyi Wang, without his effort, this thesis cannot be accomplished. In the process of compilation, he made great contribution on data collecting and algorithm implementation. Besides, he completed the Chapter 4 by himself. This thesis is rather a common achievement than a private possession. Also, I would like to acknowledge my teammates also classmates Ziyang Liu and Linyan Li from Algorithm for Big Data course for their wonderful collaboration. 9. References [1] Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12), 61-70. [2] Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001, April). Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web (pp. 285-295). [3] Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37. [4] Netflix update: Try this at home. (n.d.). Retrieved February 28, 2022, from https://sifter.org/simon/journal/20061211.html [5] Rendle, S. (2010, December). Factorization machines. In 2010 IEEE International conference on data mining (pp. 995-1000). IEEE. [6] CooperUnion. (2016, December 21). Anime recommendations database. Kaggle. Retrieved December 7, 2021, from https://www.kaggle.com/CooperUnion/anime-recommendations- database. [7] KKBOX. (n.d.). WSDM - KKBOX's Music Recommendation Challenge. Kaggle. Retrieved December 7, 2021, from https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data. [8] Aksnzhy. (n.d.). Aksnzhy/xlearn: High performance, easy-to-use, and Scalable Machine Learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface. GitHub. Retrieved December 7, 2021, from https://github.com/aksnzhy/xlearn. [9] Chang, C. C., & Lin, C. J. (2011). LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3), 1-27. [10] Rendle, S. (2010, December). Factorization machines. In 2010 IEEE International conference on data mining (pp. 995-1000). IEEE. [11] Juan, Y., Zhuang, Y., Chin, W. S., & Lin, C. J. (2016, September). Field-aware factorization machines for CTR prediction. In Proceedings of the 10th ACM conference on recommender systems (pp. 43-50). [12] He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., ... & Candela, J. Q. (2014, August). Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (pp. 1-9). [13] Gai, K., Zhu, X., Li, H., Liu, K., & Wang, Z. (2017). Learning piece-wise linear models from large scale data for ad click prediction. arXiv preprint arXiv:1704.05194.