=Paper=
{{Paper
|id=Vol-1917/paper13
|storemode=property
|title=Interpretable Matrix Factorization with Stochastic Nonnegative DEDICOM
|pdfUrl=https://ceur-ws.org/Vol-1917/paper13.pdf
|volume=Vol-1917
|authors=Rafet Sifa,Cesar Ojeda,Kostadin Cvejoski,Christian Bauckhage
|dblpUrl=https://dblp.org/rec/conf/lwa/SifaOCB17
}}
==Interpretable Matrix Factorization with Stochastic Nonnegative DEDICOM==
Interpretable Matrix Factorization with Stochasticity Constrained Nonnegative DEDICOM Rafet Sifa1,2 , Cesar Ojeda1 , Kostadin Cvejoski1 , and Christian Bauckhage1,2 1 Fraunhofer IAIS, Sankt Augustin, Germany 2 University of Bonn, Bonn, Germany www.multimedia-pattern-recognition.info {name.surname}@iais.fraunhofer.de Abstract. Decomposition into Directed Components (DEDICOM) is a special matrix factorization technique to factorize a given asymmetric similarity matrix into a combination of a loading matrix describing the latent structures in the data and an asymmetric affinity matrix encoding the relationships between the found latent structures. Finding DEDI- COM factors can be cast as a matrix norm minimization problem that requires alternating least square updates to find appropriate factors. Yet, due to the way DEDICOM reconstructs the data, unconstrained factors might yield results that are difficult to interpret. In this paper we de- rive a projection-free gradient descent based alternating least squares algorithm to calculate constrained DEDICOM factors. Our algorithm constrains the loading matrix to be column-stochastic and the affinity matrix to be nonnegative for more interpretable low rank representa- tions. Additionally, unlike most of the available approximate solutions for finding the loading matrix, our approach takes the entire occurrences of the loading matrix into account to assure convergence. We evaluate our algorithm on a behavioral dataset containing pairwise asymmetric associations between variety of game titles from an online platform. Keywords: Unsupervised Learning, Matrix Factorization, Constrained Optimization, Behavior Analysis 1 Introduction Matrix and tensor factorization methods have been widely used in data sci- ence applications for mostly understanding hidden patterns in the datasets and learning representations for variety of prediction tasks and recommender sys- tems [1, 5, 10, 11, 13, 16, 17]. Decomposition into Directed Components (DEDI- COM) [8] is a matrix and tensor factorization technique and a compact way of finding low rank representations from asymmetric similarity data. The method has found applications in various data science problems including analysis of temporal graphs [1], first order customer migration [15], natural language pro- cessing [4], spatio-temporal representation learning [2, 16] and link prediction in Copyright © 2017 by the paper’s authors. Copying permitted only for private and academic purposes. In: M. Leyer (Ed.): Proceedings of the LWDA 2017 Workshops: KDML, FGWM, IR, and FGDB. Rostock, Germany, 11.-13. September 2017, published at http://ceur-ws.org Fig. 1: Illustration of two-way DEDICOM to partition a given similarity matrix S ∈ Rn×n encoding pair-wise asymmetric relations among n objects into a combination of a loading matrix A ∈ Rn×k and a square affinity matrix R ∈ Rk×k . In this representation A contains the latent factors where the columns describe the hidden structures in S and the relations among these structures are encoded by the asymmetric affinity matrix R. Variety of constrained versions of DEDICOM has been previously studied so as to improve the interpretability and the performance of finding appropriate factors. relational and knowledge graphs [13], where the source of the addressed problems were primarily about analyzing asymmetric relationships between predefined en- tities. Formally, given an asymmetric similarity matrix S ∈ Rn×n encoding pair- wise asymmetric relations (i.e. sij 6= sji ) among n objects, two-way DEDICOM finds a loading matrix A ∈ Rn×k and a square affinity matrix R ∈ Rk×k for representing S as S ≈ ARAT . (1) More precisely, DEDICOM approximation of the asymmetric association be- tween elements i and j (i.e. sij ) can be formulated in column vector notation as n n X n X T X sij ≈ aTi: Raj: = aib rb: aj: = aib rbc ajc , (2) b=1 b=1 c=1 where ai: and aj: represent the ith and jth row of A respectively and rb: rep- resents the bth row of R. In this representation A contains the latent factors where the columns describe the hidden structures in S and the relations among these structures are encoded by the asymmetric affinity matrix R. A pictorial illustration of two-way DEDICOM factorization is shown in Fig. 1. Considering the three factor approximations of the similarity values as in (1) and (2), the interpretation (especially with respect to scale) of the hidden structures from given factor matrices A and R might yield misleading results with the presence of negative values [15]. That is, the interpretation of the re- sults is limited to only consider nonnegative affinities in R or the positive or negative loadings of the corresponding points in (2). Additionally, when analyz- ing nonnegative data matrices (such as ones containing probabilities, counts and etc.), it is usually beneficial to consider nonnegative factor matrices that can be considered as condensed (or compressed) representations that can be used for informed decision making and representation learning [1, 8, 15, 16]. In this work we address the issue of interpretability of the DEDICOM fac- tors by introducing a converging algorithm as an alternative to the previously proposed methods in [1,13, 15, 16]. Our method constrains the loading matrix A to contain column stochastic vectors and (similar to [15,16]) the affinity matrix R to be nonnegative. Formally this amounts to factorize a given asymmetric similarity matrix S as in (1) by enforcing rij ≥ 0 ∀ {i, j} (3) and n X acb ≥ 0 ∧ aqb = 1 ∀ {c, b}. (4) q=1 Compared to the additive-scaling based representation introduced in [15,16], constraining the columns of A to contain probabilities has the direct advantage of interpreting each element of A as the representativeness value of the partic- ular loading it represents. That is, the value of aib represents how much the ith element in the dataset contributes to the bth latent factor. This is indeed parallel to the idea of Archetypal Analysis [3, 6, 17], where the mixture matrices respectively determine how much a data point and archetypes contribute to re- spectively constructing the archetypes and reconstructing each data point using the archetypes. In the following we will first describe the general alternating least squares op- timization framework and give an overview of the algorithms to find appropriate constrained and unconstrained DEDICOM factors. After that we will introduce our algorithm by first studying our problem setting and deriving the algorithm step-by-step. This will be followed by a case study covering a real world ap- plication where we will analyze game-ownership patterns from data containing asymmetric associations between games using the factors we extract by running our algorithm. Finally, we will conclude our paper with a summary and some directions for future work. 2 Algorithms for Finding DEDICOM Factors Finding appropriate DEDICOM factors for matrix decomposition can be cast as a matrix norm minimization problem with the objective 2 E(A, R) = S − ARAT , (5) which is minimized with respect to A and R. Non-convex minimization problems of this kind usually follow an iterative alternating least squares (ALS) procedure where at each iteration we optimize over a selected factor treating the other factors fixed. It is important to note that, the minimized objective function in (5) is convex in R for fixed A whereas not convex in A for fixed R, which leads to optimal and sub-optimal approximate solutions when respectively updating R and A. Starting with defining update rules for the affinity matrix R, we note that with fixed A, minimization of (5) is a matrix regression problem [1, 13, 15] with global optimum solution R ← A† SAT † , (6) where A† indicates Moore-Penrose pseudoinverse of A. Furthermore, if A is con- strained to be column orthogonal (i.e. AT A = I k , where I k is the k × k identity matrix), the update for R simplifies to R ← AT SA [15]. Here the updates are not constrained to fall into a particular class of domain and might result in with affinity matrices containing negative elements. Constraining R when minimizing (5) to contain only nonnegative values can be cast as a nonnegative least squares problem by transforming the arguments of (5) as 2 E(R) = vec(S) − vec(ARAT ) (7) where the vec operator vectorizes (or flattens) a matrix B ∈ Rm×n as vec(B) = [b11 , . . . , bm1 , . . . , b1n , . . . , bmn ]T . (8) It is worth mentioning that since vectorizing (5) as shown in (7) does not affect the value of the norm [15], we can make use of the property of vectorization to represent a vectorization of multiplications of matrices as matrix vector multi- plication to rewrite (7) as 2 E(R) = vec(S) − A ⊗ A vec(R)) , (9) where ⊗ represents the Kronecker product. As noted in [15], the expression in (9) can indeed be mapped to constrained linear regression problem [12] with a converging result (through an active set algorithm) to define an update as 2 R ← argmin vec(S) − A ⊗ A vec(R)) ∧ rij ≥ 0 ∀ i, j. (10) R Having identified alternating least squares updates for constrained and un- constrained affinity matrix R, we now turn our attention to define updates for the loading matrix A. To this end, there have been two different directions fol- lowed previously: approximating the minimization of (5) by treating a particular factor with A constant [1, 13] and directly minimizing (5) by means of gradient based approaches. To start with, former method considers stacking matrix S and matrix S T as h h i h i S S T = ARAT ARTAT = A RAT RTAT , (11) and solves for A keeping AT fixed [1, 13] to come up with an update for A as −1 A ← SART + S TAR RATART + RTATAR . (12) The latter method’s ALS update for the loading matrix A is based on directly minimizing (5) by considering first order gradient based optimization [15, 16], which in each update moves the current solution for A in the opposite direction of gradient of (5). To derive the gradient matrix we start by defining (5) in terms of the trace operator as h i E(A, R) = tr S T S − S T ARAT − ART AT S + ART AT ARAT h i = tr S T S − 2S T ARAT + ART AT ARAT h i h i h i = tr S T S − 2 tr S T ARAT + tr ART AT ARAT . (13) h i Since traces are linear mappings and the term tr S T S does not depend on A, minimizing E(A, R) in (13) with respect to A is equivalent to minimizing E(A) which we define as E(A) = E1 (A) + E2 (A) (14) given that h i E1 (A) = −2 tr S T ARAT (15) and h i E2 (A) = tr ART AT ARAT . (16) Considering the orthogonality constraint on A, the term in (16) becomes inde- pendent of A. That is, since traces are invariant under cyclic permutation we can reformulate (16) as h i h i h i E2 (A) = tr ART AT ARAT = tr RT AT ARAT A = tr RT R , (17) which allows for only considering the minimization of the error term in (15) [15, 16]. Consequently, we can define a gradient descent based update by considering the gradient matrix ∂E1 (A) = −2 S T AR + SART , (18) ∂A which with a learning rate ηA allows us to define an update for A as A ← A + 2ηA S T AR + SART . (19) As a final step, in oder to assure the orthogonality constraint on A we project the updated A by considering its QR decomposition [15, 16] A = QT , where Q ∈ Rn×k is orthogonal and T ∈ Rk×k is invertible upper triangular, and following the update A ← Q. This concludes our overview of the methods to come up with appropriate factors. For more detailed information about the alternating least squares solutions for DEDICOM factorization for matrices and tensors we refer the reader to [1, 13, 15, 16]. 3 Stochasticity Constrained Nonnegative DEDICOM In this section we will present a new ALS algorithm to particularly consider DEDICOM factorization with column stochastic loadings and nonnegative affini- ties for interpretability of the resulting factors. To begin with, as noted in [15,16], we can assure nonnegative affinities R by solving the nonnegative least squares problem introduced in [12] as in (10). Next considering the theoretical properties of constraining the columns of matrix A to (4), we note that each column resides in the standard simplex ∆n−1 , which is the convex hull of the standard basis vectors of Rn . Formally, given that δij represents the Kronocker delta and V = {v 1 , v 2 , . . . , v n |v i = [δi1 , . . . , δin ]T } is the set of the standard basis vectors of Rn , the standard simplex ∆n−1 is defined as the compact set n nX n X o ∆n−1 = αi v i | αi = 1 ∧ αi ≥ 0 ∀ i ∈ [1, . . . , n] . (20) i=1 i=1 This indicates that, finding an appropriate column stochastic matrix A minimiz- ing the convex objective function in (5) can be reduced down to a constrained optimization problem in the convex compact simplex ∆n−1 . Constrained optimization problems of this kind can be easily tackled using the Frank-Wolfe algorithm [7,9], which is also known as the conditional gradient method [9]. The main idea behind the Frank-Wolfe algorithm is to minimize differentiable convex functions over their domains forming compact convex sets using iterative first-order Taylor approximations until achieving convergence. Formally given a differentiable convex function f : S → R over a compact convex set S, the Frank-Wolfe algorithm aims to solve min f (x) (21) x for x ∈ S, by iteratively solving for st = min sT ∇f (xt ), (22) s∈S where ∇f (xt ) is the gradient of the optimized function f evaluated at the current solution xt . Following that, at each iteration t, the algorithm estimates the new solution by evaluating xt+1 = xt + αt (st − xt ), (23) where αt ∈ (0, . . . , 1] is the learning rate that is usually selected by performing a line search on (23) or set to be monotonically decreasing as a function of t [7, 9]. Additionally, one particular advantage of this optimization method is that it allows for -approximation for a given maximum number of iterations tmax [7,9]. Considering our special case, since the set of vertices of a standard simplex is equivalent to the set of standard basis V in Rn [3], at each iteration Frank-Wolfe algorithm moves the current solution into the direction of one of the standard basis until achieving convergence. Following that, since the stochasticity constraint does not allow for a simpli- fication as in (17), we require the full derivation of the gradient matrix ∂E(A) ∂A from (13) so as to adapt the Frank-Wolfe algorithm to find appropriate stochas- tic columns for A. To this end we start by considering E2 (A) from (16) in terms of a scalar summation as h i XXXXXX tr ART AT ARAT = auv rwv axw axy ryz auz (24) u v w x y z and consider its partial derivative with respect to an arbitrary element aij of A that can be written due to linearity of differentiation as ∂E2 (A) X X X X X X ∂ = (auv rwv axw axy ryz auz ) . (25) ∂aij u v w x y z ∂aij The expression in (25) can be further simplified by the product rule expansion ∂E2 (A) X X X X X X ∂ = (auv rwv )(axw axy ryz auz ) + ∂aij u v w x y z ∂aij XXXXXX ∂ (axw )(auv rwv axy ryz auz ) + u v w x y z ∂aij XXXXXX ∂ (axy )(auv rwv axw ryz auz ) + u v w x y z ∂aij XXXXXX ∂ (auz ryz )(auv rwv axw axy ) (26) u v w x y z ∂aij and evaluating the derivatives results in ∂E2 (A) X X X X = rwv axw axy ryz auz + ∂aij w x y z XXXX auv rwv axy ryz auz + u v y z XXXX auv rwv axw ryz auz + u v w z XXXX ryz auv rwv axw axy . (27) v w x y Finally, reformulating the four summations in (27) in terms of matrix multipli- cations as ∂E2 (A) h i h i = 2 ART AT AR + 2 ARAT ART (28) ∂aij ij ij Randomly initialize valid A and R Let V = {v 1 , v 2 , . . . , v n |v i = [δi1 , . . . , δin ]T } while Stopping condition is not satisfied do //Update A with Frank-Wolfe procedure while t 6= tmax and updates of A are not small do //Define the gradient matrix G = ∂E(A) T T T T T ∂A = 2 AR A AR + ARA AR − S AR − SAR T //Define the monotonically decreasing learning rate α ← 2/(t + 2) for b ∈ {1, . . . , k} do //Find the minimizer simplex vertex i i = argmin gbl l //Update columns ab of A ab ← ab + α(v i − ab ) //Update the iterator t←t+1 //Update R by solving the nonnegative least squares problem 2 R ← argmin vec(S) − A ⊗ A vec(R)) w.r.t rij ≥ 0 ∀ i, j R Alg. 1: A Frank-Wolfe based projection-free ALS algorithm to find Semi- nonnogative DEDICOM factors with column stochastic loading matrix A and nonnegative affinity matrix R. At each iteration of the algorithm, we alternate between finding an optimal column stochastic A keeping R fixed and finding a nonnegative matrix R keeping A fixed. Note that the learning rate here is set to decrease monotonically. allows us to define the gradient matrix ∂E∂A 2 (A) as ∂E2 (A) = 2 ART AT AR + ARAT ART . (29) ∂A Combining the results from (18) and (29) the full gradient matrix of the minimized error norm for A is calculated as follows ∂E(A) = 2 ART AT AR + ARAT ART − S T AR − SART , (30) ∂A which allows us to define a column-wise Frank-Wolfe algorithm to come up with optimal column stochastic loading matrix A minimizing (5). We list the essential steps of our adaptation of the projection free Frank-Wolfe algorithm in Alg. 1. In a nutshell, our algorithm starts by randomly initializing (valid) matrices A and R, continues with alternating between the Frank-Wolfe optimization updates to come up with optimal column stochastic A and alternating least squares 270 260 250 240 RSS 230 220 210 200 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ALS Iterations Fig. 2: Illustration of the evolution of the Residual Sum of Squares (RSS) for factorizing our empirical conditional probability matrix indicating asymmetric relations between games using DEDICOM with k = 3. Our provably converging algorithm monotonically decreases the RSS at each iteration. updates for nonnegative R until a predefined stopping condition is met. The stopping conditions are usually selected based on reaching a maximum number of alternating least squares updates (not to be mixed with the maximum iteration count tmax of the Frank-Wolfe updates for columns of A in Alg. 1), having minor subsequent changes in the minimized objective (for our case the matrix norm in (5)) or combining both of these conditions. 4 A Business Intelligence Application: Analyzing Asymmetric Game Ownership Associations in an Online Gaming Platform In order to evaluate our algorithm, we consider a use-case from game analyt- ics [14, 16], where we analyze the asymmetric relationships among various types of games. To this end we consider a dataset from an online gaming platform called Steam, which hosts thousands of games of various genres to a large player-base with its size ranging from 9 to 15 million3 (daily active) players. We used the dataset from [14] which contains game ownership information of more than six million users about 3007 titles. Representing the ownership information as a bipartite matrix Y ∈ {0, 1}m×n indicating ownership of information of m players for n games, we construct the asymmetric association among games in terms of their pairwise empirical conditional probabilities. To this end, we define the directional similarity from game i to game j as |{c |yci 6= 0 ∀ c} ∩ {b |ybj 6= 0 ∀ b}| sij = , (31) |{c |yci 6= 0 ∀ c}| 3 http://store.steampowered.com/stats/ Loadings Identifier Dominant Games a1 TF 2/FPS TF 2, CS: Source and Red Orchestra: Ostfront 41-45 a2 Indie/Platformer WSS, Ethan and Oozi: Earth Adventure a3 Flagships/AAA Left 4 Dead 2, Dota 2, Skyrim, HL 2, Garry’s Mod Table 1: Selection of games with high loading values for the analyzed dataset. where | · | indicates set cardinality and ypq represents if the pth player owns the qth game. It is important to note that (31) is inherently asymmetric and a spe- cial case of the so-called Tversky index [18] with prototype and variant weights of respectively 1 and 0 (or vice versa). Since the number of computations re- quired to obtain the similarity matrix S scales to O(mn2 ), we parallelized the procedure of computing the similarity values in (31) by utilizing a hybrid par- allelization technique that simultaneously exploits both distributed and shared memory architectures in order to speedup the computation time. Having obtained the asymmetric similarity matrix with the empirical condi- tional probability values in S, we factorize it using our algorithm and present the results with k = 3. To explicitly ignore modeling the self-loops, at each alter- nating least squares iteration, we replaced the diagonal of S with the diagonal of its current reconstruction [1]. In Fig. 2 we show the residual sum of squares, which decreases monotonically and converges to a minimum after 30 iterations. Analyzing the resulting factors, we observe that the resulting two columns (or modes) a1 and a2 of A are sparse whereas the last column was dense in terms of non-zero probability values, where the most dominant games vary from column to column. In Tbl. 1 we present the games with high loadings in their corre- sponding column and observe that mostly the free-to-play flagship game Team Fortress 2 (TF 2), followed by the FPS games CS: Source and Red Orchestra: Ostfront 41-45, contribute to a1 .On the other hand, the indie-platformer games Wooden Sen’SeY (WSS), Ethan (Meteor Hunter) and Oozi: Earth Adventure contribute mostly to a2 . Finally, the mostly dense a3 has very high loadings for all of the flagship games of the analyzed platform and other AAA games including Left 4 Dead 2, Dota 2, Skyrim, Half-life 2 (HL 2) and Garry’s Mod. Analyzing the resulting row-normalized affinity matrix that encodes the asym- metric relations between the modes from Fig. 3, the first rows indicates tenden- cies of TF 2 players more to the indie-platformer mode than to the AAA games because of the fact that TF 2 is mostly a singular game that people primarily prefer in the Steam platform [14]. On the other hand, inline with the results from [14] the opposite directional associations, namely, from the flagships to the TF 2 are high which is related to the fact that majority of the players playing one or more of the flagship games also prefer TF 2 [14]. Finally analyzing the second mode, similar to the results in [15] the self association is the highest association we observe (indicating players preferring mostly to stay in the same genre), this is followed by high associations to the TF 2 and the mode related to the flagship and AAA games. TF2/FPS Indie/Platform Flagships/AAA TF2/FPS 0.2469 0.5158 0.2372 Indie/Platform 0.3637 0.4867 0.1496 Flagships/AAA 0.6623 0.1042 0.2334 Fig. 3: Results of factorizing the empirical conditional probability graph created for the pair-wise game ownership information of more than six million players of an online gaming platform. The normalized nonnegative affinity matrix R shows relationships between groups automatically determined by the DEDICOM algorithm. It is worth mentioning that, parallel to the results in [1], by increasing the number of modes k for the loading matrix A, we observed more detailed par- titions regarding the modeled patterns. Running our algorithm with k = 2, we observed a higher reconstruction error and that the indie-platformer mode a2 remains to be one of the factors whereas the games contributing to a3 have merged with ones contributing to mode a1 . On the other hand, considering the resulting factors when we ran our algorithm with k = 4, we obtained a slight improvement regarding the reconstruction error and observed that the modes a1 and a2 remained the same, however, mode a3 has split into two different modes where both contain the same games as in a3 and the new mode contains additional indie and plaformer games (reducing the probability on the games of a3 ) such as Finding Teddy, Gentlemen, Gravi and Face Noir. 5 Conclusion and Future Work In this work we studied the DEDICOM model to factorize asymmetric similarity matrices into a combination of a low rank loading matrix and an affinity ma- trix. We gave an overall overview about the theoretical details and algorithms to come up with appropriate factors. In essence, the affinity matrix matrix R indicates different directional structures that are determined by the columns of the loading matrix A. Having defined our objective function to be the matrix norm of the difference between the factorized asymmetric similarity matrix and its DEDICOM reconstruction, we derived the gradient matrix for this function with respect to the loading matrix A and presented a variant of the Frank-Wolfe algorithm to obtain interpretable column stochastic loadings and nonnegative affinities.Running our algorithm on a dataset containing asymmetric pairwise relationships between more than 3000 games, we found interesting patterns in- dicating directional preferences among games. Our future work involves analyzing the performance of our model for different tasks including link prediction and representation learning [13, 16]. Considering the tensor extensions in [13,16], our future work also involves extending the pro- posed model to tensors factorizations which can allow us to analyze asymmetric similarity matrices from different sources. In the context of business intelligence and game analytics, this might allow us to analyze and compare, for instance, preferences of different countries or player groups. References 1. Bader, B., Harshman, R., Kolda, T.: Temporal Analysis of Semantic Graphs using ASALSAN. In: Proc. of IEEE ICDM (2007) 2. Bauckhage, C., Sifa, R., Drachen, A., Thurau, C., Hadiji, F.: Beyond Heatmaps: Spatio-Temporal Clustering using Behavior-Based Partitioning of Game Levels. In: Proc. of IEEE CIG (2014) 3. Bauckhage, C.: k-Means Clustering via the Frank-Wolfe Algorithm. In: Proc. of KDML-LWDA (2016) 4. Chew, P.A., Bader, B.W., Rozovskaya, A.: Using DEDICOM for Completely Unsu- pervised Part-Of-Speech Tagging. In: Proc. Workshop on Unsupervised and Mini- mally Supervised Learning of Lexical Semantics (2009) 5. Cremonesi, P., Koren, Y., Turrin, R.: Performance of Recommender Algorithms on Top-N Recommendation Tasks. In: Proc. ACM Recsys (2010) 6. Cutler, A., Breiman, L.: Archetypal Analysis. Technometrics 36(4), 338–347 (1994) 7. Frank, M., Wolfe, P.: An Algorithm for Quadratic Programming. Naval Research Logistics 3(1-2) (1956) 8. Harshman, R.A.: Models for Analysis of Asymmetrical Relationships among N Objects or Stimuli. In: Proc. Joint Meeting of the Psychometric Society and the Society for Mathematical Psychology (1978) 9. Jaggi, M.: Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In: Proc. of ICML (2013) 10. Kantor, P., Rokach, L., Ricci, F., Shapira, B.: Recommender Systems Handbook. Springer (2011) 11. Koren, Y., Bell, R., Volinsky, C.: Matrix Factorization Techniques for Recom- mender Systems. Computer 42(8) (2009) 12. Lawson, C.L., Hanson, R.J.: Solving Least Squares Problems. SIAM (1974) 13. Nickel, M., Tresp, V., Kriegel, H.: A Three-way Model for Collective Learning on Multi-relational Data. In: Proc. of ICML (2011) 14. Sifa, R., Drachen, A., Bauckhage, C.: Large-Scale Cross-Game Player Behavior Analysis on Steam. In: Proc. of AAAI AIIDE (2015) 15. Sifa, R., Ojeda, C., Bauckhage, C.: User Churn Migration Analysis with DEDI- COM. In: Proc. of ACM RecSys (2015) 16. Sifa, R., Srikanth, S., Drachen, A., Ojeda, C., Bauckhage, C.: Predicting Retention in Sandbox Games with Tensor Factorization-based Representation Learning. In: Proc. of IEEE CIG (2016) 17. Sifa, R., Bauckhage, C., Drachen, A.: Archetypal Game Recommender Systems. In: Proc. of KDML-LWA (2014) 18. Tversky, A.: Features of Similarity. Psychological Review 84(4) (1977)